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Preface 



The International Conference TSD 2001, the fourth event in the series on Text, Speech, 
and Dialogue, which originated in 1998, presents state-of-the-art technology and recent 
achievements in the field of natural language processing. But, this year’s TSD 2001 is 
organized as the “real” conference involving all components belonging to such an event 
- invited talks given by top-class researchers and plenary and problem oriented sessions 
are preceeded by four tutorials evolving around all problem areas of human-machine 
interaction. 

The conference declares its intent to be an interdisciplinary forum, which intertwines 
research in speech and language processing as well as research in the Eastern and Western 
hemispheres. We feel that the mixture of different approaches and applications gives 
a great opportunity to get acquainted with the current activities in all aspects of language 
communication and to witness the amazing vitality of researchers from the former East 
Block countries. The financial support of ISCA (International Speech Communication 
Association) enables the wide attendance of researchers from all active regions of the 
world. 

This book contains a collection of all the papers presented at the international conference 
organized by the Eaculty of Applied Sciences of the University of West Bohemia in Pilsen 
in collaboration with the Eaculty of Informatics, Masaryk University in Brno, and held 
in the beautiful setting of Zelezna Ruda (Czech Republic), September 11-13, 2001. A 
total of 59 accepted papers out of 1 17 submitted, contributed by 122 authors (47 from 
Central Europe, 1 1 from Eastern Europe, 54 from Western Europe, 5 from America, and 
5 from Asia) are included in these conference proceedings. 

We would like to gratefully thank the invited speakers and the authors of the papers for 
their valuable contributions, the ISCA for its financial support, and Prof. Jezek, Dean of 
the Eaculty of Applied Sciences, for greeting the conference on behalf of the University 
of West Bohemia. 

Last but not least, we would like to express our gratitude to the authors for their strive 
to provide the papers in time, to members of the Program Committee for their careful 
reviews, to the editors for their hard work in preparing these proceedings, and to the 
members of the Local Organizing Committee for their enthusiasm in organizing the 
conference. 
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Abstract. Two kinds of systems have been defined during the long history of 
WSD: principled systems that define which knowledge types are useful for WSD, 
and robust systems that use the information sources at hand, such as, dictionaries, 
light-weight ontologies or hand-tagged corpora. This paper tries to systematize 
the relation between desired knowledge types and actual information sources. We 
also compare the results for a wide range of algorithms that have been evaluated 
on a common test setting in our research group. We hope that this analysis will 
help change the shift from systems based on information sources to systems based 
on knowledge sources. This study might also shed some light on semi-automatic 
acquisition of desired knowledge types from existing resources. 



1 Introduction 

Research in Word Sense Disambiguation (WSD) has a long history, as long as Machine 
Translation. A vast range of approaches has been pursued, but none has been successful 
enough in real-world applications. The last wave of systems using machine learning 
techniques on hand-tagged corpora seems to have reached its highest point, far from the 
expectations raised in the past. The time has come to meditate on the breach between 
principled systems that have deep and rich hand-built knowledge (usually a Lexical 
Knowledge Base, LKB), and robust systems that use either superficial or isolated infor- 
mation. 

Principled systems attempt to describe the desired kinds of knowledge and proper 
methods to combine them. In contrast, robust systems tend to use whatever lexical re- 
source they have at hand, either Machine Readable Dictionaries (MRD) or light-weight 
ontologies. An alternative approach consists on hand-tagging word occurrences in cor- 
pora and training machine learning methods on them. Moreover, systems that use corpora 
without the need of hand-tagging have also been proposed. In any case, little effort has 
been made to systematize and analyze what kinds of knowledge have been put into 
play. We say that robust systems use information sources, and principled systems use 
knowledge types. 

Another issue is the performance that one can expect from each information source 
or knowledge type used. Little comparison has been made, specially for knowledge 
types, as each research team tends to evaluate its system on a different experimental 
setting. The SENSEVAL competition [1] could be used to rank the knowledge types 
separately, but unfortunately, the systems tend to combine a variety of heuristics without 
separate evaluation. We tried to evaluate the contribution of each information source 
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and knowledge type separately, testing each system in a common setting; the English 
sense inventory from WordNet 1.6 [2], and a test set comprising either all occurrences 
of 8 nouns in Semcor [3] or all nouns occurring in a set of 4 random files from Semcor. 

This paper is a first attempt to systematize the relation between desired knowledge 
types and actual information sources, and to provide an empirical comparison of results 
on a common experimental setting. In particular, a broad range of systems that the authors 
have implemented is analyzed in context. 

Although this paper should review a more comprehensive list of references, space 
requirements allow just a few relevant references. For the same reason, the algorithms 
are just sketched (the interested reader has always a pointer to a published reference), 
and the results are given as averages. 

The structure of the paper is as follows. Section 2 reviews traditional knowledge 
types useful for WSD. Section 3 introduces an analysis of the information sources for 
a number of actual systems. Section 4 presents the algorithms implemented, together 
with the results obtained. Section 5 presents a discussion of the results, including future 
research directions. Finally, section 6 draws some conclusions. 



2 Knowledge Types Useful for WSD 

We classify the knowledge types useful for disambiguating an occurrence of a word 
based on Hirst [4], McRoy [5] and our own contributions. The list is numbered for 
future reference. 

1. Part of speech (POS) is used to organize the word senses. For instance, in WordNet 
1.6 handle has 5 senses as a verb, only one as a noun. 

2. Morphology, specially the relation between derived words and their roots. For 
instance, the noun agreement has 6 senses, its verbal root agree 7, but not all com- 
binations hold. 

3. Collocations. The 9-way ambiguous noun match has only one possible sense in 
“football match”. 

4. Semantic word associations, which van be further classified as follows: 

(a) Taxonomical organization, e.g. the association between chair scad furniture. 

(b) Situation, such as the association between chair and waiter. 

(c) Topic, as between bat and baseball. 

(d) Argument-head relation, e.g. dog and bite in “the dog bite the postman”. 
These associations, if given as a sense-to-word relation, are strong indicators for 
a sense. For instance, in “The chair and the table were missing” the shared class in 
the taxonomy with table can be used to choose the furniture sense of chair. 

5. Syntactic cues. Subcategorization information is also useful, e.g. eat in the “take 
a meal” sense is intransitive, but it is transitive in other senses. 

6. Semantic roles. In “The bad new will eat him ” the object of eat fills the experiencer 
role, and this fact can be used to better constrain the possible senses for eat. 

7. Selectional preferences. For instance, eat in the “take a meal” sense prefers humans 
as subjects. This knowledge type is similar to the argument-head relation (4d), but 
selectional preferences are given in terms of semantic classes, rather that plain words. 
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8. Domain. For example, in the domain of sports, the “tennis racket” sense of racket 
is preferred. 

9. Frequency of senses. Out of the 4 senses of people the general sense accounts for 
90 % of the occurrences in Semcor. 

10. Pragmatics. In some cases, full-fledged reasoning has to come into play to disam- 
biguate head as a nail-head in the now classical utterance “Nadia swing the hammer 
at the nail, and the head flew ojf” [4]. 

Some of these types are out of the scope of this paper: POS tagging is usually per- 
formed in an independent process and derivational morphology would be useful only 
for disambiguating roots. 

Traditionally, the lexical knowledge bases (LKBs) containing the desired knowledge 
have been built basically by hand. McRoy, for instance, organizes the knowledge related 
to 10,000 lemmas in four inter-related components: 

1 . lexicon: core lexicon (capturing knowledge types 1 , 2, 5 and 9) and dynamic lexicons 
(knowledge type 8) 

2. concept hierarchy (including 4a, 6 and 7) 

3. collocational patterns (3) 

4. clusters of related definitions (sets of clusters for 4b and 4c) 

Manual construction of deep and rich semantic LKBs is a titanic task, with many short- 
comings. It would be interesting to build the knowledge needed by McRoy’s system 
by semi-automatic means. From this perspective, the systematization presented in this 
paper can be also understood as a planning step towards the semi-automatic acquisition 
of such semantic lexicons. 



3 Information Sources Used in Actual Systems 

WSD systems can be characterized by the information source they use in their algo- 
rithms, namely MRDs, light-weight ontologies, corpora, or a combination of them [6]. 
This section reviews some of the major contributors to WSD (including our implementa- 
tions). The following section presents in more detail the algorithms that we implemented 
and tested on the common setting. The systems are organized according to the major 
information source used, making reference to the knowledge types involved. 

MRDs (4, 5, 7, 9). The first sense in dictionaries can be used as an indication of the 
most used sense (9). Other systems [7] [8] try to model semantic word associations 
(4) processing the text in the definitions in a variety of ways. Besides, [7] uses the 
additional information present in the machine-readable version of the LDOCE dictionary, 
subject codes (4a), subcategorization information (5) and basic selectional preferences 
(7). Unfortunately, other MRDs lack this latter kind of information. 

Ontologies (4a). Excluding a few systems using proprietary ontologies, most systems 
have WordNet [2] as the basic ontology. Synonymy and taxonomy in WordNet provide 
the taxonomical organization (4a) used in semantic relatedness measures [9] [10]. 
Corpora (3, 4b, 4c, 4d, 5). Hand tagged corpora has been used to train machine learning 
algorithms. The training data is processed to extract features, that is, cues in the context of 
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the occurrence that could lead to disambiguate the word correctly. For instance, Yarowsky 
[11] showed how collocations (3) could be captured using bigrams and argument-head 
relations. In the literature, easily extracted features are preferred, avoiding high levels of 
linguistic processing [11] [12] [13] [14]. In general, two sets of features are distinguished: 

1. Local features, which use local dependencies (adjacency, small window, and limited 
forms of argument-head relations) around the target sense. The values of the features 
can be word forms, lemmas or POS. This set of features tries to use the following 
knowledge types without recognizing them explicitly: collocations (3), argument- 
head relations (4d) and a limited form of syntactic cues (5), such as adjacent POS. 

2. Global features consist on bags of lemmas in a large window (50, 100 words wide) 
around the target word senses. Words that co-occur frequently with the sense would 
indicate that there is a semantically association of some kind (usually related to the 
situation or topic, 4b, 4c). 

During testing, a machine learning algorithm is used to compare the features extracted 
from the training data to the actual features in the occurrence to be disambiguated. The 
sense with the best matching features is selected accordingly. 

MRD and ontology combinations (4a, 4b, 4c) have been used to compensate for the 
lack of semantic associations in existing ontologies like WordNet. For instance, [15] 
combines the use of taxonomies and the definitions in WordNet yielding a similarity 
measure for nominal and verbal concepts which are otherwise unrelated in WordNet. 
The taxonomy provides knowledge type 4a, and the definitions implicitly provide 4b 
and 4c. 

MRD and corpora combinations (3, 4b, 4c, 4d, 5). [16] uses the hierarchical orga- 
nization in Roget’s thesaurus to automatically produce sets of salient words for each 
semantic class. These salient words are similar to McRoy’s clusters [5], and could cap- 
ture both situation and topic clusters (4b, 4c). In [17], seed words from a MRD are used 
to bootstrap a training set without the need of hand-tagging (all knowledge types used 
for corpora could be applied here, 3, 4b, 4c, 4d, 5). 

Ontology and corpora combinations (3, 4b, 4c, 4d, 5, 7). In an exception to the general 
rule, selectional preferences (7) have been semi-automatically extracted and explicitly 
applied to WSD [9] [18]. The automatic extraction involved the combination of parsed 
corpora to construct sets of e.g. nouns that are subjects of an specific verb, and a simi- 
larity measure based on a taxonomy is used to generalize the sets of nouns to semantic 
classes. In a different approach [14], the information in WordNet has been used to build 
automatically a training corpus from the web (thus involving knowledge types 3, 4b, 4c, 
4d, 5). A similar technique has been used to build topic signatures, which try to give 
lists of words topically associated for each concept [19]. 

For some of the knowledge types we could not find implemented systems. We are not 
aware of any system using semantic roles, or pragmatics, and domain information is 
seldom used. 
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4 Experimental Setting and Implementation of Main Algorithms 

A variety of the WSD algorithms using the information sources mentioned in the previous 
section have been implemented in our research team. These algorithms are presented 
below, with a summary in Table 1 , but first, the experimental setting will be introduced. 

The algorithms were tested in heterogeneous settings: different languages, sense 
inventories, POS considered, training and test sets. For instance, a set of MRD-based 
algorithms where used to disambiguate salient words in dictionary definitions for all 
POS in a Basque dictionary. Nevertheless, we tried to keep a common experimental 
setting during the years: The English sense inventory taken from WordNet 1.6 [2], and a 
test set comprising either all occurrences of 8 nouns (account, age, church, duty, head, 
interest, member and people) in Semcor [3] or all polysemous nouns occurring in a set 
of 4 random files from Semcor (br-aOl, br-b20, br-j09 and br-r05). Some algorithms 
have been tried on the set of words, others on all words in the 4 files, others on both. Two 
algorithms have been tested on Semcor 1.4, but the results are roughly similar (Table 1 
shows the random baselines for both versions of WordNet). 

It has to be noted that WordNet plays both the roles of the ontology (e.g. offering 
taxonomical structure) and the MRD (e.g. giving textual definitions for concepts). 

4.1 Algorithms Based on MRD 

In [8] we present a set of heuristics that can be used alone or in combination. Basi- 
cally, these heuristics use the definitions in a Spanish and a French MRD in order to 
sense-disambiguate the genus terms in the definitions themselves. Some of the sim- 
plest techniques have been more recently tried on the common test set and can thus be 
compared. These heuristics are the following: 

Main sense. The first sense of the dictionaries is usually the most salient. This fact can 
be used to approximate the most frequent sense (MFS, 9). In our implementation, the 
word senses in WordNet are ordered according to frequency in Semcor, and thus the first 
sense corresponds to the MFS. The hgures are taken from [14] [19]. Table 1 shows the 
results for both the 8-noun and 4-file settings. The MFS can be viewed as the simplest 
learning technique, and it constitutes a lower bound for algorithms that use hand-tagged 
corpora. 

Definition overlap. In the most simple form the overlap between the definitions for 
the word senses of the target word and the words in the surrounding context is used 
[19]. This is a very limited form of knowledge type 4, but its precision (cf. Table 1 for 
results on the 8-noun set) is nevertheless halfway between the random baseline and the 
MFS. We have also implemented more sophisticated ways of using the definitions, as 
co-occurrences, co-occurrence vectors and semantic vectors [8], but the results are not 
available for the common experimental setting. 

4.2 Algorithms Based on Ontologies 

Conceptual density [10] [20] is a measure of concept-relatedness based on taxonomies 
that formalizes the common semantic class (knowledge type 4a). The implementation 
for WordNet was tested on the 4-file test set using WordNet version 1.4. 
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Table 1. Summary of knowledge sources and results of algorithms implemented. The first colums 
shows the information source, the second the specific information used and the related knowledge 
types and the third the algorithm used. Finally evaluation is given for the two test sets using 
precision (correct answers over all examples) and coverage (all answers over all examples). Note: 
global content here includes the combination of local and global context. 



Information source 


Knowledge types 


Algorithm 


Results 

(prec./cov.) 

8 nouns 4 files 


Random baseline 


- 


WordNet 1.4 


- 


- 


.30 


1.0 






WordNet 1.6 


.19 


1.0 


.28 


1.0 


MRD 


Main sense 9 




.69 


1.0 


.66 


1.0 




Definition 4 


Overlap 


.42 


1.0 


- 


- 


Ontology 


Hierarchy 4a 


Cone. Density 


- 


- 


.43 


.80 


Corpora 


Most freq. sense 9 




.69 


1.0 


.66 


1.0 




Local context 3 4d 5 


Decision lists 


.78 


.96 


* 


* 




Syntactic cues 5 


” 


.70 


.92 


- 


- 




Arg.-head relations 4d 


” 


.78 


.69 


- 


- 




Global context 4b 4c 


” 


.81 


.87 


.69* 


.94* 


MRD + Corpora 


Semantic classes 4b 4c 


Mutual info. 


- 


- 


.41 


1.0 


Ontology - 1 - Corpora 


Selectional pref. 7 


Probability 


.63 


.33 


.65 


.31 




Topic signatures 4b 4c 


Chi^ 


.29 


.99 


- 


- 




Aut. tagged corp. 3 4 5 


Decision lists 


.13 


.71 


- 


- 



4.3 Algorithms Based on Corpora 

Hand tagged corpora has been used to train machine learning algorithms. We are par- 
ticularly interested in the features used, that is, the different knowledge sources used by 
each system [11] [12] [13] [14]. We have tested a comprehensive set of features [14] 
[21] which for the sake of this paper we organized as follows: 

- Local context features comprise bigrams and trigrams of POS, lemmas and word 
forms, as well as a bag of the words and lemmas in a small window comprising 
4 words around the target [14]. These simple features involve knowledge about 
collocations (3), argument-head relations (4d) and limited syntactic cues (5). 

- In addition to the basic feature set, syntactic dependencies were extracted to try to 
model better syntactic cues (5) and argument-head relations (4d) [21]. The results 
for both are given separately in Table 1 . 

- The only global feature in this experiment is a bag of word for the words in the 
sentence (knowledge types 4b, 4c) [14]. 

We chose to use one of the simplest yet effective way to combine the features: decision 
lists [22]. The decision list orders the features according to their log-likelihood, and the 
first feature that is applicable to the test occurrence yields the chosen sense. In order to 
use all the available data, we used 10-fold cross-validation. Table 1 shows the results 
in the 8-noun set for each of the feature types. In the case of the 4-file set, only the 
combined result of local and global features is given. 




Knowledge Sources for Word Sense Disambiguation 



7 



4.4 Algorithms Based on a Combination of MRD and Corpora 

In the literature, there is a variety of ways to combine the knowledge in MRDs with 
corpora. We implemented a system that combined broad semantic classes with corpora 
[16] [20], and disambiguated at a coarse-grained level (implicitly covering knowledge 
types 4b and 4c). It was trained using the semantic files in WordNet 1.4 and tested 
on the 4-file setting only. In order to compare the performance of this algorithm that 
returns coarse-grained senses with the rest, we have estimated the fine-grained precision 
choosing one of the applicable fine-grained senses at random. The results are worse 
than reported in [16], but it has to be noted that the organization in lexical files is very 
different from the one in Roget’s. 

4.5 Algorithms Based on Ontologies and Corpora 

Three different approaches have been tried in this section: 

- Selectional preferences (7). We tested a formalization that learns selectional prefer- 
ences for classes of verbs [ 1 8] on subject and object relations extracted from Semcor. 
In this particular case, we used the sense tags in Semcor, partly to compensate for the 
lack of data. The results are available for both test settings. Note that the coverage 
of this algorithm is rather low, due to the fact that only 33 % of the nouns in the test 
sets were subjects or objects of a verb. 

- Learning topic signatnres from the web (4b, 4c). The information given in Word- 
Net for each concept is used to feed a search engine and to retrieve a set of training 
examples for each word sense. These examples are used to induce a set of words 
that are related to a given word sense in contrast with the other senses of the target 
word [19]. Topic signatures are constructed using the most salient words as given 
by the Chi^ measure. Table 1 shows the results for the 8-noun setting, which are 
slightly above the baseline. 

- Inducing a training corpus from the web (3, 4b, 4c, 4d, 5). In a similar approach, 
the training examples retrieved from the web are directly used to train decision lists, 
which were tested on the 8 noun set [14]. The results are very low on average, but 
the variance is very high, with one word failing for all test samples, and others doing 
just fine. 



5 Discussion and Future Directions 

From the comparison of the results, it is clear that algorithms based on hand-tagged 
corpora provide the best results. This is true for all features (local, syntactic cues, 
argument-head relations, global), including the combination of hand-tagged corpora 
with taxonomical knowledge (selectional restrictions). Other resources provide more 
modest results: conceptual density on ontologies, definition overlap on MRDs, or the 
combination of MRD and corpora. The combinations of corpora and ontologies that try 
to acquire training data automatically are promising, but current results are poor. 

If the results are analyzed from the perspective of knowledge types, we can observe 
the following: 
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a) Collocations (3) are strong indicators if learned from hand-tagged corpora. 

b) Taxonomical information is very weak (4a). 

c) Semantic word associations around topic (4b) and situation (4c) are powerful when 
learned from hand-tagged corpora (but difficult to separate one from the other). 
Associations learned from MRDs can also be useful. 

d) Syntactic cues (5) are reliable when learned from hand-tagged corpora. 

e) The same applies for selectional preferences (7), but in this case the applicability, 
is quite low. It is matter of current experimentation to check whether the results are 
maintained when learning from raw corpora. 

f) MFS (9) is also a strong indicator that depends on hand-tagged data. 

g) POS (1), morphology (2), semantic roles (6), domain (8) and pragmatic (10) knowl- 
edge types have been left aside. 

The results seem to confirm McRoy’s observation that collocations and semantic word 
associations are the most important knowledge types for WSD [5], but we have noticed 
that syntactic cues are equally reliable. Moreover, the low applicability of selectional 
restrictions was already noted by McRoy [5]. 

All in all, hand-tagged corpora seems to be the best source for the automatic acqui- 
sition of all knowledge types considered, that is, collocations (3), semantic associations 
(situation 4b, topic 4c and argument-head relation 4d), syntactic cues (5), selectional 
restrictions (7) and MFS (9). Only taxonomic knowledge (4a) is taken from ontologies. 
In some cases it is difficult to interpret the meaning of the features extracted form cor- 
pora, e.g. whether a local feature reflects a collocation or not, or whether a global feature 
captures a topical association or not. This paper shows some steps to classify the features 
according to the knowledge type they represent. 

However strong, hand tagged algorithms seem to have reached their maximum point, 
far from the 90 % precision. These algorithms depend on the availability of hand-tagged 
corpora, and the effort to hand-tag the occurrences of all polysemous words can be a 
very expensive task, maybe comparable to the effort needed to build a comprehensive 
LKB. Semcor is a small sized corpus (around 250,000 words), and provides a limited 
amount of training data for a wide range of words. This could be the reason for the low 
performance (69 % precision) when tested on the polysemous nouns in the 4 Semcor 
files. Unfortunately, training on more examples does not always raise precision much: 
in experiments on the same sense inventory but using more data, the performance raised 
from 73 % precision for a subset of 5 nouns to only 75 %. In the first SENSEVAL [1] 
the best systems were just below the 80 % precision. 

We think that future research directions should resort to all available information 
sources, extending the set of features to more informed features. Organizing the infor- 
mation sources around knowledge types would allow for more powerful combinations. 
Another promising area is that of using bootstrapping methods to alleviate or entirely 
eliminate the need of hand-tagging corpora. Having a large number of knowledge types 
at hand can be the key to success in this process. 

The work presented in this paper could be extended, specially to cover more infor- 
mation sources and algorithms. Besides, the experimental setting should include other 
POS apart from nouns, and all algorithms should be tested on both experimental set- 
tings. Einally, it could be interesting to do a similar study that focuses on the different 
algorithms using principled knowledge and/or information sources. 
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6 Conclusions 

We have presented a first attempt to systematize the relation between principled knowl- 
edge types and actual information sources. The former provide guidelines to construct 
LKBs for theoretically motivated WSD systems. The latter refer to robust WSD systems, 
which usually make use of the resources at hand: MRD, light-weight ontologies or cor- 
pora. In addition, the performance of a wide variety of knowledge types and algorithms 
on a common test set has been compared. 

This study can help to understand which knowledge types are useful for WSD, and 
why some WSD systems perform better than others. We hope that in the near future, 
research will shift from systems based on information sources to systems based on 
knowledge sources. We also try to shed some light on the possibilities for semi-automatic 
enrichment of LKBs with the desired knowledge from existing resources. 
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Abstract. The Prague Dependency Treebank (PDT) project is conceived of as 
a many-layered scenario, both from the point of view of the stratal annotation 
scheme, from the division-of-labor point of view, and with regard to the level of 
detail captured at the highest, tectogrammatical layer. The following aspects of the 
present status of the PDT are discussed in detail: the now-available PDT version 
1.0, annotated manually at the morphemic and analytic layers, including the recent 
experience with post-annotation checking; the ongoing effort of tectogrammatical 
layer annotation, with a specific attention to the so-called model collection; and 
to two different areas of exploitation of the PDT, for linguistic research purposes 
and for information retrieval application purposes. 



1 Introduction 

It is our conviction that an existence of a large corpus together with a rich annotation 
scheme applied to it offers a quite new level of possible topics for investigation, using the 
annotated data themselves or data gained by automatic tagging procedures developed 
on their basis. 

Therefore, in the build-up of the Prague Dependency Treebank (PDT), for which 
we use a subcollection of texts from the Czech National Corpus ([4]) we have tried to 
develop a scenario that would be as multi-aspectual as possible. This is reflected first of 
all in the overall annotation scheme conceived of as a three-layer scenario comprising 
tags from the morphemic, analytic and underlying- syntactic layer (for a description of 
the annotation scheme of PDT, see e.g. [5], [8], [6], [10], [11], and the two manuals for 
annotation published as Technical Reports ([7], [13]). 

The annotation on the highest, underlying-syntactic layer, the result of which are the 
so-called tectogrammatical tree structures (TGTS) is based on the original theoretical 
framework of Functional Generative Description as proposed by Petr Sgall in the late 
sixties and developed since then by the members of his research team ([20]). It goes 
without saying that such a rich description of the morphemic and syntactic properties 
of sentences (including some basic coreferential relations) cannot be achieved without 
a thorough and detailed inspection of the language corpus itself (before we can attempt 
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to work on an automatic tagging procedures, be it stochastic or based on hand-written 
rules). Therefore, we have annotated each and every sentence in the PDT manually, 
with the help of several productivity software tools developed during the project (which, 
however, did include some automatic preprocessing modules). 

In the present contribution we would like to discuss in some detail the present status 
of the development of the PDT and the experience we have hitherto gained. In Sect. 2, 
we describe the just finished annotation of the morphemic and analytic layers of PDT 
(which we call PDT version 1 .0), discussing in some length our experience with the post- 
annotation checking; our first experience with the annotation on the tectogrammatical 
layer is summarized in Sect. 3. Sect. 4 then presents some examples of possibilities the 
tagged corpus offers for researchers in both linguistics and natural language processing 
applications (specifically, in information retrieval). We conclude with some observations 
concerning the outlook of the PDT. 

2 The PDT, Version 1.0 

2.1 PDT 1.0 Overview 

Since the process of manual annotation of tens of thousands of sentences is a lengthy 
one, and since we want to make our results available to the community promptly, we 
have decided not to wait until all three layers are annotated, but to release the annotation 
on the first two layers (morphemic and analytic) right after it is finished. We call the 
result the PDT, version 1.0 ([17])'. The annotation consists of a unique {lemma, tag) 
pair at the morphemic layer and a unique {head pointer, analytical function) pair at the 
analytic layer assigned to each token (word form, number, or punctuation occurrence) 
in the corpus. 

The PDT 1.0’s data layout facilitates experiments based on various methods (pri- 
marily statistical, but not only those) and especially allows for their fair comparison. 
The data on each layer is thus already divided into a training set and two sets of evalu- 
ation data. The morphological tagging as part of the analytic layer annotation has been 
provided by two different statistical taggers ([8], [9]). 

About 1.8 million tokens have been annotated on the morphological layer, and 1.5 
million tokens (almost 100,000 sentences) on the analytic layer. The data itself is marked 
using a common SGML DTD (csts . dtd), and a different format is used for some 
viewing and editing by legacy tools provided for PDT users. 

The organization of such an annotation effort is not an easy task. A total of 32 people 
contributed to this project to this date, with as many as 20 working simultaneously at 
a peak time. 

2.2 Post-annotation Checking 

In a manually annotated corpus, the single most important issue is consistency. It is thus 
natural to devote a large proportion of time {resources) to corpus checking, in addition 
to the annotation proper. 

* The “version 0.5” (ca. 1/4 of the data annotated on the two layers) has been available since 
1998 on our website and has attracted 90 researchers from 17 countries. 
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There was no annotation manual on the morphemic layer (the categories in the 
tagset used correspond directly to what every high school graduate knows about Czech 
morphology), but the data was double-tagged as usual. While nine annotators have been 
involved in the first pass of the annotation, only two annotators have participated in the 
checking step. The discrepancies coming from the two annotated versions of the same 
file were checked and disambiguated by only one annotator. After that, each file has 
been checked against the automatic morphological analyser (AMA), which produces 
a set of (lemma, tag) pairs for each word form. If the manually assigned pair is not 
found in this set, we have (a relatively easily solvable) consistency problem: either the 
AMA is wrong or incomplete, or the manual annotation has to be corrected by manual 
inspection. The annotation problems found here are, however, not just plain annotation 
errors; misspellings in the original text and formatting errors (such as wrongly split or 
joined word forms) are discovered and corrected, keeping the original input (properly 
marked) for further exploitation, such as spelling error analysis. The AMA check had 
to be done several times, since the dictionary of the AMA has changed several times 
during the time frame of the project. 

The situation with the analytic layer was a bit different. The group of annotators^ 
(being all linguists by education) has been writing a common set of Guidelines ([7]) 
during the course of annotation (most of it at the beginning of the project, of course), 
solving new problems on-the-go. Needless to say that the solutions of some of the prob- 
lems affected the already-annotated part of the corpus, leading to re-annotation of certain 
phenomena in it. However careful the annotators have been when doing so, inconsisten- 
cies could not be avoided. Moreover, our limited resources forced us to annotate every 
text by one annotator only, and we have used some automatic processing during the 
annotation process as well: the analytical functions have been preassigned by a small set 
of hand-crafted rules, which operated on the manually created sentence structure, and 
during the later stages of the annotation process, we have also preassigned the sentence 
structure using a statistical parser ([3]) trained on data annotated so far. In both cases, 
the annotators have been instructed to correct both the structure and the analytical func- 
tions to conform to the Guidelines. Once the annotation of all the data at the analytic 
layer had been finished, we have applied a list of 5 1 consistency rules (“tests” regarding 
the linguistic content), created by inspection of the data and by formal specification of 
known problems. The tests were intended to help us locate the most evident mistakes 
that the annotators, authors or programs could have made during the process of annota- 
tion. Some of those test (and corrections) could be done almost fully automatically, but 
some of them had to be carried out manually (after automatic preparation and flagging 
of questionable spots). Several additional manual (even though sometimes only partial) 
passes through the data were thus required. Additional tests have been designed and 
carried out to discover technical problems, such as missing or incorrect markup etc. 

Since the PDT 1.0 contains two layers (morphemic and analytic), we could take ad- 
vantage of the relations between the two layers for checking purposes as well, improving 
consistency across the layers at the same time. The sentence context of the analytic layer 



^ A different group than that for the morphemic layer. 
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allowed us to discover otherwise hidden annotation errors on the morphemic layer and 
vice versa^. The following is a partial list of the cross-layer checks made: 

1. The verb complements at the analytic layer (namely the nodes annotated by the 
analytical function Obj (object), Sb (subject) and Pred (predicate)) were tested 
against their morphological tags. 

2. Prepositional phrases (PPs) headed by certain prepositions listed in the Guidelines 
have been checked for the analytical function against those lists. The case(s) of 
nouns, pronouns, numerals, and adjectives inside a PP have been checked against 
the possible “valency” of its head preposition. 

3. Agreement in case, gender and number between a predicate and its subject (Sb), as 
well as between and attribute (Atr) and its head was checked. 

The resources needed for the annotation on the morphemic and analytic layers can 
be roughly estimated as follows (in percent of the total manpower): 

1. the “raw” morphemic layer annotation: 25% 

2. the “raw” analytic layer annotation: 15% 

3. post-annotation checking (both layers), related software development: 20% 

4. data processing (layer merging) and associated manual corrections: 5% 

5. documentation (inch analytic-layer Guidelines): 5% 

6. annotation software tool development: 25% 

7. supervision and administration (both layers): 5% 

The total manpower does not include the development of the morphological analyzer 
and the morphological dictionary of Czech, the Czech taggers, the Czech version of the 
analytic (syntactic) parser, nor the effort needed for the initial collection and basic markup 
of the texts used. 



3 Towards PDT 2.0: The Tectogrammatical Layer 

The tectogrammatical annotation ([10], [11], [13]) of the PDT is carried out on two 
sub-levels, resulting in (i) a Targe’ collection which captures the underlying syntactic 
structure of the sentence (in the shape of dependency tree structures distinguishing about 
40 types of syntactic relations called functors) and the basic features of the topic-focus 
articulation (TFA) in terms of three values of a TFA attribute and of the underlying order 
of sister nodes of each elementary subtree of the dependency tree, and (ii) a ‘model’ col- 
lection with more subtle distinctions of valency relations (in terms of a subcategorization 
of the valency slots by means of the so-called syntactic grammatemes primarily cap- 
turing the meanings of prepositions) and with values that indicate the basic coreference 
links between nodes (within the same sentence but also across sentences). 

In the present contribution we illustrate the complexity of the task of the annotation 
of the ‘model’ collection on some of the issues concerning the restoration of nodes for 

^ Of course, manual correction had to be done after the suspicious annotations have been flagged 
by the checking software. 
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semantically obligatory complementations (valency slots) of verbs and of postverbal 
nouns and adjectives and those concerning coreferential relations of the restored nodes 
to their antecedents. 

At the present stage, the annotators have restored semantically obligatory comple- 
mentations as dependents of the given verbs, postverbal nouns or adjectives according 
to the following instructions: 

(i) restore a node with the lemma Gen in case a General Participant is concerned: e.g. 
in the TGTS for the sentence Nds chlapec uz cte. [Our boy already reads.] a node is 
restored depending on the verb cist [read] with the lexical label Gen and the functor 
PAT; the attribute Coref for coreference is left untouched; 

(ii) restore a node with the lemma Cor in case of grammatical coreference (i.e. with 
verbs of control, with relative pronouns and the possessive pronoun svuj): e.g. in 
the TGTS for the sentence Podnik hodld zvysit vyrobu. [The company intends to 
increase the production] a node is restored depending on the verb zvysit [increase] 
with the lexical label Cor and the functor ACT; the attribute of coreference gets the 
relevant values; 

(iii) restore a node with the pronominal lemma on in case of textual coreference (i.e. the 
deletion of the respective node in the surface shape of the sentence is conditioned by 
the preceding context rather than by some grammatically determined conditions): 
e.g. in the sequence of sentences Potkal jsi Jirku? Potkal. [Have you met Jirka? 
(I-)Met.] two nodes are restored in the TGTS of the second sentence depending on 
the verb potkat [meet], one with the pronominal lemma jd and the functor ACT and 
one with the pronominal lemma on and the functor PAT; with the latter node the 
attribute of Coref is filled in by the lemma Jirka. 

Our experience with the first samples of sentences tagged for the ‘model’ collection 
has shown that for the restoration of obligatory participants with verbs and postver- 
bal nouns and adjectives in cases of textual coreference a new lemma Unsp(ecified) 
has to be introduced in order to capture situations when the restored node refers to the 
‘contents’ of the preceding text rather than to some particular element; the informa- 
tion on the antecedent is vague. On the other hand, the restored lemma differs from 
the lemma Gen introduced for General Participants because, in principle, with Gen no 
antecedent is present . The attribute of Coref with the restored node has the value NA 
(=non-applicable). We believe that this solution offers a possibility of further linguistic 
inquiries into the issues of coreferential relations because it leaves a trace specifying the 
problematic cases. 

Let us illustrate the above points on some examples from the PDT. We add literal 
English translation for each sentence. 

(1) Prudky rust madarskeho prumyslu. 

(2) Madarska prumyslova vyroba se v lohskem roce zvysila o devet procent v porovnam 
s rokem 1993. 

(3) Ve stavebnictvi byl zaznamenan dokonce dvacetiprocentnf pnrustek. 

(4) Vyplyva to z udaju, ktere v patek zvefejnil centralnf statisticky ufad. 

(5) Spotfebm ceny stouply v mezirocnfm srovnam o 18,8 procenta, zatimco v roce 1993 
dosahla mfra inflace 22,5 procenta. 
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(!’) A rapid increase of Hungarian industry. 

(2’) Hungarian industrial production increased in the last year by nine percent in com- 
parison with the year 1993. 

(3’) In building industry (there) was recorded even a twenty percent increase. 

(4’ ) (it) follows from the data which on Friday (were) published (by) the National Census 
Bureau. 

(5’) The prices went up in the yearly comparison by 18.8 percent, while in the year 1993 
the rate of inflation reached 22.5 percent. 

In sentences (I) through (5) there occur several instances of verbs and postverbal 
nouns the complementations of which have to be restored. In (3) and (5) the nodes for 
a General Actor (Gen.ACT) have to be restored as dependents on the verb zaznamenat 
[record] and on the noun [comparison], respectively. The lexical label Gen indi- 

cates that no antecedent with these nodes exists, because the Actors can be paraphrased 
as ‘those who recorded’ and ‘those who compared’, respectively. However, the second 
obligatory complementation, namely the Patient (PAT) in the frame of the postverbal 
noun srovndm [comparison] can be uniquely determined: it is clear from the context that 
a comparison of prices is concerned. The restored node for the Patient gets the pronom- 
inal lemma on and the attribute Coref gets the value cena [price]. The restored node 
for the Actor dependent on the postverbal noun pnrustek [increase] in (3) has a similar 
character. It is clear from the preceding sentence (2) that the Hungarian industrial pro- 
duction increased, i.e. that an increase of production is concerned. Therefore the restored 
node for Actor depending on pnrustek [increase] gets the lemma on and the attribute 
Coref gets the value vyroba [production]. On the other hand, in the TGTS of (4) a node 
for the Patient depending on the verb zvefejnit [publish] should be restored (the valency 
frame of this verb can be paraphrased by ‘someone.ACT publishes something. EFF about 
something. PAT’), but its antecedent is rather vague: we can guess from the context that 
the statistical institute will publish the data on the problems of the increase of the Hun- 
garian industry, but the increase may concern the Hungarian industrial production or the 
building industry — the concrete reference is not clear. Thus the restored node gets the 
lemma Unsp and the attribute Coref gets the value NA. 

Let us add two more examples illustrating the restoration of nodes with the lemma 
Unsp: 

(6) Poslanecka snemovna schvalila novelu zakona o mimosoudnich rehabilitacich. 

(7) Novela, jiz nakonec cesky parlament posvetil, ma vsak tolik zadrhelu, ze k jasotu 
nem sebemensi duvod. 

(8) Za jednoznacne pozitivum Ize povazovat snad jen fakt, ze zakon vubec prosel. 

(9) Jeho schvalem pfedchazela uporna jednam uvnitf koalice. 

(6’) The House of Commons approved the novel of the law on out-of-court rehabilita- 
tions. 

(7’) However, the novel, which in the end the Czech Parliament has sanctioned, has so 
many trouble spots that for jubilation (there) is not the slightest reason. 

(8’) As clear positive can be considered perhaps only the fact that the law has been 
approved. 

(9’) Its approval (was) preceded (by ) tough negotiations within he coalition. 
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In the surface shape of (9) there is no overt Actor with the postverbal noun schvdlem 
[approval]; the restored Actor will get the lemma Unsp because it is not clear from 
the context whether the restored node refers back to the House of Commons, or the 
Parliament. As for the postverbal noun jedndm [negotiations] , all the restored dependents 
will get the lemma Unsp: though it is possible to guess that some coalition party (ACT) 
negotiated with some other coalition party (ADDR) about the problems (PAT) of the 
novel of the law, no coalition parties have been mentioned in the previous context and 
the reference again is only vague. The attribute of Coref with all the three restored nodes 
will get the value NA. 

(10) Patecnl rozhodnuti snemovny , ze . . . “pfedjima svym zpusobem dalsi vyvoj diskusi 
o cirkevnich restitucich”. 

(1 1) Zpravodaji LN to vcera fekl mistopfedseda parlamentu Jan Kasai (KDU-CSL). 

(12) Spolu s precedentnim patecmm zasahem do obecnrho majetku tak podle neho mizi 
velka cast dosavadnich pfekazek. 

(10’) Friday decision of the House of Commons that . . . “(it) anticipates in a way the 
further development of the discussions on the church restitutions.” 

(lU) The correspondent of LN (was) told this (by) the vice-chairman of the Parliament 
Jan Kasai (KDU-CSL). 

(12’) Together with the precedent Friday intervention into the communal property thus 
according to him disappears a great part of the hitherto obstacles. 

In (10) an Actor is restored depending on the noun diskuse [discussion], with the 
lemma Unsp (someone.ACT discusses with somebody.ADDR about something. PAT); it 
is probable from the context that the discussion will be carried out by the political parties 
in the Parliament, but again this cannot be determined univocally. In ( 1 2) an Actor should 
be restored under the postverbal noun zdsah [intervention], with Unsp as its lexical label 
because it is not clear whether the intervention was made by the House of Commons, or 
whether the speaker has somebody else in mind. 

The decision on the boundary lines among the different types of deletions linked 
to the choice among the “lemmas” of the restored nodes is not be an easy task. Our 
strategy again is to mark the difficult cases in a way that allows for their relatively 
simple identification and thus to prepare resources for further linguistic research. 

4 Exploitation of the PDT 

4.1 Linguistic Research 

From the very beginning, the annotation of the PDT has been guided by the effort not 
to loose any important piece of information encoded in the text itself, but at the same 
time not to overload the annotation scheme and thus not to prevent the annotators to 
present some reliable and more or less uniform results. It is then no wonder that the most 
obvious exploitation of the PDT is for linguistic research as such, which, in its turn, 
offers most important material for the Improvement, precisation and clarification of the 
annotation instructions, and, in the long run, for possible modifications of the scheme 
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itself. As an example, let us mention the research on the so-called PP-attachment (i.e. the 
ambiguities resulting from the possibility to attach a prepositional group to more than 
one of the preceding elements) carried out by Stranakova-Lopatkova ([22]). The author 
formulated an algorithm for the solution of these ambiguities based on an original formal 
framework of deletion automata ([18]) and she used the PDT both as an empirical basis 
for her study and as a testing bed. 



4.2 Annotation Automation 

Another crucial point of an annotation scenario is the division of labor between automatic 
and manual procedures; one would prefer as much automation as possible while not 
compromising the precision of the human annotation. Such considerations have led to a 
development of an experimental system for automatic functor (pre-)assignment which is 
based on a machine-learning approach ([23]). Possibilities are examined how to resolve 
ambiguities of functor assignment on the basis of the meanings of prepositions and 
their combinability with nouns of different positions in the EuroWordNet ontology. We 
are also building a valency dictionary containing information relevant on all layers, i.e. 
functors (for TGTS) and morphosyntactic information, on the basis of valency frames 
from various sources, primarily from the material already contained in the PDT. 



4.3 Applications: Information Retrieval 

Searching for information in huge amounts of full texts is still mostly word-based; to 
obtain the relevant (pieces of) documents, a word-based information retrieval (IR) system 
matches word forms from the user’s query with those occurring in the documents. This 
technique is often ineffective because the matching process based on word forms does 
not respect natural language. To avoid this problem one can try and match concepts 
expressed by words rather than words themselves. When we want to work with concepts 
instead of words, the IR system should be able to compare the concepts in order to 
determine (or at least to estimate) their semantic similarity or differences. 

The basic means which enable or support the construction of the concept models are 
namely; 1) lemmatization and lexical and morphological disambiguation, 2) word sense 
disambiguation, 3) automatic recognition of dependency syntactic structure of sentences, 

4) identification of semantically significant collocations or phrases (i.e. such composi- 
tions that create meanings which cannot be composed from the meanings of their parts), 

5) identihcation of semantic classes (i.e. classes of semantically similar senses, including 
synonymy), 6) identihcation of semantic signihcance of words (different words carry 
different amount of information and consequently they differ in capability to character- 
ize the semantic content), 7) utilization of various electronic dictionaries (they can, e.g., 
provide lists of word meanings, typical phrases, explanatory ones provide typical exam- 
ples of use or dehnitions, thesauri include relations among word senses, etc.) including 
8) the EuroWordNet, 9) utilization of a large collection of natural language documents 
and its statistical processing and 10) utilization of morphologically and syntactically 
annotated corpora. These means are not mutually independent, a better mastering one 
of them supports the others. 
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Above all we point out that without the statistical processing of empirical data we 
can know neither the frequencies of language phenomena or structures nor the contexts 
in which they appear; both can be obtained by the processing of a large collection of 
texts (cf. [12]). The preparation for retrieving information in a (large) collection of texts 
consists of two processes (the second supposes and makes use of the first): the first is 
the mapping of the whole collection and modeling the language (of these texts) by the 
extraction of stochastic quantities which characterize the language (the system of these 
quantities makes a map or a model of the collection; this model should be adequate for 
our purpose, i.e. it should comprehend the features important for the semantic analysis), 
and the second is the mapping of the content of each document separately; in fact, 
these maps or models are the models of the document concepts (the semantic content of 
a document is considered here as a (very complex) composite concept). 

Both the preparatory processes are based statistically and both make use of the anno- 
tated corpus. The human understanding of the text is latently included in the annotated 
corpus and consequently the corpus implicitly contains the knowledge of the structure 
of natural language. This knowledge can be utilized in the process of text analysis us- 
ing suitable machine-learning algorithms. In contrast to traditional linguistic methods, 
the procedures based on corpora and utilizing powerful computers enable us to process 
large, statistically significant patterns of empirical natural language data. In the process 
of machine learning, when the computer learns the structure of natural language, the 
knowledge of stochastic distribution of various phenomena in the language extracted 
from a corpus is one of the crucial sources of information necessary for the learning. 
The statistical view on raw data is radically different from the view on linguistically an- 
alyzed data. If we work with texts morphologically and syntactically labeled, we obtain 
a much more precise picture of them by statistical processing. From this point of view, 
it becomes evident that an annotated corpus is of fundamental importance. 



5 Conclusion 

We have tried to demonstrate on some selected issues how complex the task of syntactic 
annotation of a corpus is and what solutions have been chosen to make it usable. The 
future efforts will be concentrated on four domains: (i) to reach a solid volume of anno- 
tated data on the third layer, (ii) to extend the scenario to make it possible to design some 
kind of a formal semantic (logical) representation (the “fourth” layer), and eventually to 
annotate such layer of the PDT, (iii) to use the PDT in the domains of information re- 
trieval, information extraction and/or computerized translation, for the purpose of which 
an extensive work with parallel (or at least comparable) corpora is a necessary precon- 
dition, and (iv) to prepare grounds for a similarly systematic compilation and annotation 
of a spoken language (speech) corpus. Ars longa, vita brevis. 
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Abstract. Against background of the growing need of information, which for 
language used to be supplied in a rather limited way, the new solution found 
in language corpora and the way how this has been implemented is outlined and 
discussed. For the Czech language, this solution has materialized in the 1 00 million 
representative Czech National Corpus (CNC, 2000). In the following, a brief tour is 
offered through various stages of its build-up, characterizing both various corpora 
within CNC and giving some figures about proportions of various types of language 
represented. The last part of the contribution sets a minimal programme for further 
research and desiderata to be followed in general in this branch of important and 
international stream of modern science. 



1 Language Corpora: Information and Use 

Linguistics has always suffered from a disease one might call data insufficiency, although 
linguists have only rarely admitted that this was the case. By the data, language data 
are meant here, of course, a necessary and usual precondition to any information and 
conclusions the linguist is likely to draw, just like in any other science. This has always 
been the mainstream in linguistics and Chomsky’s perennial contempt for data hardly 
invalidates this general data necessity. To be true to past, however, linguists have not 
always been aware that they lack more data and reliable information, a fact which is 
being gradually revealed only now, with new and better outputs based on and supported 
by better data. It used to be prohibitively expensive and time-consuming to collect large 
amounts of data manually, an experience familiar to anyone who has worked with citation 
slips from lexical and other archives trying to compile a dictionary. To create such an 
archive of some 10-15 million slips took many decades and many hands. Thus, there 
seemed to be a natural quantitative limit which was difficult to reach and which was 
almost impossible to cross. The amount of information which could be gleaned from 
such a limited archive was used for compilation of all dictionaries of the past, grammars 
and other reference books, including school textbooks we are still using today. 

With the advent of computers and modern very large corpora, all of this has been, 
rather suddenly, changed. For the first time in history, the linguist avails of multiples of 
previous amounts of data, their flood being, in fact, so overwhelming that he still has 
not got used to it and feels like a drowning person. It has become obvious now that 
the information to be found in this kind of data is both vastly better and balanced than 
anything before and that the quality of information is proportionate to the amount of 
data amassed. Not surprisingly, this information has to be drawn from contexts, where 
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it is coded in a variety of ways, and texts and ways have to be found how to get at the 
information needed. In turn, however, this casts a shadow over the quality and reliability 
of our present-day dictionaries and grammars making them problematic and dated. 

With modern corpora in existence and use, it was easy to see that these might be useful 
for many other professional and academic disciplines and quarters of life, including 
general public and schools. After all, ours is the Information Society, as it has been termed 
recently, and it is obvious that there is a growing need for information everywhere. As 
there is, practically, no sector of life and human activity, no profession or pastime, where 
information is not communicated through and by the language, the conclusion seems 
inevitable: the information needed is to be found in language corpora. Should one fail 
in finding in corpora what he or she may need, then these corpora are either still too 
small, though they may have hundreds of millions of words already, or too one-sided and 
lacking in that particular type of language, as this happens to be the case of the spoken 
language. It is evident that there is no alternative to corpora as the supreme information 
source and that their usefulness will further grow. 

2 Czech Solution: The Czech National Corpus 

2.1 The Situation in the Early 90ies and the Project of CNC 

Just like anywhere else, the best kind of linguistic research in the Czech language could 
not but be based on a data archive in past. There has been the old Academy’s tradition of 
manually collecting language data on citation slips which have, over some eight decades, 
accrued to reach some 12-15 million archive of excerpts. These have been drastically 
cut down in the sixties when it was felt that enough data has been accumulated for a new 
dictionary of Czech. Since then, most of the work done was based on this lexical archive 
including the compilation of a new large dictionary of contemporary literary Czech 
(Slovm'k spisovneho jazyka ceskeho) in four volumes which came out in 1960-1971. 
However, no extensive and systematic coverage of the language has been started ever 
since. 

When in the early nineties vague plans have been made for a new dictionary of the 
Czech language, which would reflect all the turmoil and social changes taking place, 
the kind of data needed for this have been found to be non-existent. At the same time, 
it was evident that the old manual citation slip tradition could not be resumed and one 
was facing a considerable data gap of over 30 years. My suggestion early in 1991-2 was 
that a computer corpus is built from scratch at the Academy of Sciences to be used for a 
new dictionary and for whatever it might be necessary to use it for, a move not exactly 
applauded by some influential people at the Academy. 

Yet, times have changed and it was no longer official state-run institutions but real 
people who felt they must decide this and act upon their decision and determination. 
Thus, a solution was found which took shape of a new Department of the Czech National 
Corpus which has been established in 1994 at the Charles University (or rather its 
Faculty of Philosophy), introducing thus a base for a branch of corpus linguistics as 
well (Cermak 1995, 1997, 1998). This solution was supported by a number of open- 
minded linguists who did feel this need, too. After the foundation of the Institute of 
the Czech National Corpus, all of these people continued to cooperate, subsequently as 
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representatives of their respective institutions. They now form an impressive cooperating 
body of people from five faculties of three universities and two institutes of the Academy 
of Sciences. Having gradually gained support, in various forms, from the State Grant 
Agency, Ministry of Education and from a publisher, people were found, trained and 
the Czech National Corpus project (CNC), being academic and non-commercial one, 
could have been launched. In the year 2000, the first 100-million word corpus, called 
SYN2000 has gone public and was offered for general use (Cermak 1997, 1998, Cesky 
narodni korpus 2000). 

The general framework of the project is quite broad as there are, in fact, more than 
one corpora planned and built at the same time. Briefly, its aim is to cover as much as 
possible of the Czech language and in as many forms as are accessible. The overall design 
of the Czech National Corpus consists of many parts, the first major division following 
the III synchrony-diachrony distinction where an orientation point in time is, roughly, 
the year 1990. Both major branches are each split into the (1) written, (2) spoken and 
(3) dialectal types of corpora, though this partition, in the case of the spoken language, 
cannot be upheld for the diachronic corpora. Yet this is only the tip of the iceberg, so to 
speak, as this is preceded by much larger storage and preparatory forms our data take 
on first, namely by the I Archive and II Bank of CNC. 

In what follows, I will briefly outline each form and stage the data go through before 
reaching their final stage and assuming the form which may be exploited. The first text 
format, in fact a variety of them, one gets from providers or which is scanned into the 
computer, is stored in the I Archive of CNC. Of course, there is also this laborious 
zero stage (0) of getting texts, from the providers mainly, which is not really easy 
and smooth as one would wish, often depending on the whims of individual providers, 
legal act of securing their rights and physical transport of the data finally obtained. The 
Archive is constantly being enlarged and contains, at the moment, some 400-500 million 
words in various text forms. All of these texts are gradually converted, cleaned, unified 
and classified and, having been given all this treatment, they flow into the II Bank 
of CNC. Conversion has to be oriented towards the rich variety of formats publishers 
prefer to use and implies, in many cases, that a special conversion programme has to 
be developped allowing for this. Cleaning does not mean any correction of real texts 
which are sacrosanct and may not be altered in any way. Rather, effort is made to find 
and extract (1) duplicate texts or large sections of them which, surprisingly and for a 
number of reasons, are found quite often. Then, (2) foreign language paragraphs are 
identified and removed, these being due to large advertisements, articles published in the 
Slovak language etc. Finally, (3) most of non-textual parts of texts, such as numerical 
tables, long lists of figures or pictures are taken out, too. So treated, each text gets, then, 
the SGML format with a DTD (data type definition) containing an explicit and detailed 
information about the kind of texts, its origin, classification etc., including information 
about who of the staff of CNC is responsible for each particular stage of the process. 

It is obvious that to be able to do this and achieve the final text stage and shape in 
the Bank of CNC, one has to have a master plan designed showing what types of texts 
should be collected and in what proportions. While more about this will be said later, it is 
necessary now to mention that this plan has been implemented and recorded in a special 
database the records of which are mirrored in the corpus itself. 
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At the moment, CNC is served by a comprehensive retrieval system called gcqp, 
which is based on the Stuttgart cqp programme. This has been considerably expanded 
(mostly by P. Rychly) and given a sophisticated graphic interface supported by Windows, 
although it is still UNIX-based, of course. It now offers a rich variety of search functions 
and facilities. 

2.2 Corpora within CNC. SYN2000 

There is no clear-cut boundary to be found in language telling us where its history stops 
and the presence starts, or, to put linguistically, where to draw the dividing line between 
diachrony and synchrony. Thus, time limits used for this in the case of CNC must of 
necessity be somewhat arbitrary and are likely to be moved forward with the progress of 
time. It was very easy to draw the line in the domain of informative texts (newspapers and 
magazines) where the political turnabout of 1989/1990 brought a substantial change in 
the type of language used: while the old communist jargon and platitudes were suddenly 
gone, a new stress was laid on urgent and new topics. Hence, this year of 1990 has 
been accepted as the general starting-point for most imaginative texts, i.e. fiction and 
poetry. There is, however, a notable exception to be found here forming a bridge to past. 
Since, obviously, classics are constantly re-edited and re-read, a provision was made 
to include these if found to belong to this category. Thus, both a selection of books 
published after 1945, i.e. the end of the World War II, and of authors born after 1880 
was made to complement the fresh and new books published after 1990 for the first time 
which are in clear majority, however. Also technical texts, i.e. those belonging to various 
specialized branches of knowledge stick to the year 1990 as their starting point in time 
and criterion for inclusion. With the notable and well-argued, but rather small exception 
of literature described above, this corpus has gone up to 1999, has been finished a 
year later and released as a 100 million corpus of written contemporary Czech under 
the name of SYN2000, which stands for synchronic and the year of its completion. In 
comparison to British National Corpus, spanning a much larger interval, i.e. 1974-1994, 
it is comfortably limited to a very narrow time-span. This, in turn, will easily allow for a 
further continuation, which is now in preparation and which will follow it soon, to cover 
the most recent years. 

The synchronic spoken language, covered by a corpus called Prague Spoken Corpus 
(ORAL-PMK), has just been released, too, to complement the written SYN2000. This 
is a small corpus having almost 800 000 words of authentic spoken language, speech 
recorded in Prague reflecting the manpower and financial limits one had to use. In fact, 
to obtain this kind of corpus is a much more costly and time-consuming activity. This 
is also why there is such a scarcity of spoken corpora everywhere and if a language 
has one, it is invariably very small (the exception being the costly BNC’s ten percent of 
the spoken component). There is no doubt whatsoever, that the spoken corpora are both 
more valuable for a linguist, and future development will have to switch over to these as 
primary sources of language and its change. 

The language covered in the Prague Spoken Corpus is made up of some 300 tape- 
recorded conversations with a representative variety of speakers. The sociolinguistic 
variables observed include a balanced selection of (1) sex (women and men in equal 
proportions), (2) age (younger and older, i.e. two categories, one under 35 years but 
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over 20, and the other including people older than 35 years) and (3) education {lower 
and higher, i.e. basic as against higher secondary or university education). These three 
variables have been calculated and balanced to form a grid which was projected over the 
fourth one, (4) that of the type of the recording. This, too, was split into two subtypes, 
one made up of a series of answers to very broad and comprehensive questions (this 
type being called formal) and the other made of free conversations of two subjects who 
knew each other well (this being called the informal type). Thus, a final grid made up 
of 8 basic variables (giving an impressive number of permutations which had all to be 
represented) was obtained and recorded. 

An extensive tagging of this corpus has just been finished and its results will, hope- 
fully contribute to restoring the balance between the official codified and somewhat stiff 
written standard Czech (spisovnd cestina) and the unofficial, non-codified and very much 
used spoken standard of Czech {obecnd cestina), which has been repeatedly denied ex- 
istence and representation in dictionaries, grammars etc. To illustrate the importance of 
this, let me only note that some linguists are on the verge of seeing a diglossia here. It 
is to be hoped that other spoken corpora will soon be made available, too, specifically 
one from the region of Brno and Plzen. 

To finish this part, two remarks are of some relevance. It is obvious that we have settled 
for nothing less but the authentic spoken language discarding all sorts of intermediary 
forms, such as radio lectures, public addresses as not being purely and prototypically 
spoken. Dialectal corpora exist now in germ form only, their greatest problem being 
scarcity of contemporary data. 

The diachrony or history dimension of language is a somewhat different task, since 
texts have to be scanned into computer, and this has to be viewed in a long-lasting and 
slowly-moving perspective. In general, the Diachronic Corpus of CNC (DIAKORP) 
aims to cover all of the Czech language past up to the point where the synchronic corpus 
takes over. This ambitious task would, when finished, offer a continous stream of data 
from all periods recorded, allowing thus for a higher standard of the language study, a 
better insight into general tendencies in its development etc. The very meagre beginnings 
of the written language records were decided to be covered in their entirety, i.e. roughly 
from 1250 onwards until a point is reached when, due to a profusion of texts available, 
selection criteria have to be applied. In general and in contrast to the synchronic corpus, 
texts are represented here in samples, mostly. 

Some problems to be solved are different here. Most are related or due to variant 
spelling (hence all texts prior to 1 849 to be included in the corpus are transcribed while 
their original spelling is kept and found in the Bank, of course). The current size of this 
diachronic corpus, which is growing all the time, is about 2 million words now. Obviously, 
provisions are made for historical dialectal forms, too, if found and acquired in electronic 
form. These would, then, form the diachronic counterpart to the synchronous dialectal 
corpus. 

2.3 Domain Distribution in CNC 

Most of corpora have never really attempted any degree of real criteria-based representa- 
tiveness and vague proclamations that corpus should be based on a balance of language 
production and reception were no help. As it is virtually impossible to find out what sort 
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of criteria to use and in what quantitative proportions here, we have decided to embark 
upon our own series of research as to the language reception only, i.e. the way people 
are being exposed to language and are thus “receiving” it, so to speak. The first research, 
run by sociologists (Cermak 1997, Cermak-Kralik-Kucera 1997), came up with the first 
surprises showing that fiction has been, on the one hand, largely overestimated and that 
its real representation and readership is much lower. On the other hand, technical and 
specialized texts were shown to be more widely read than the previous popular esti- 
mates indicated. The figures, then, were 33,5 % for the specialized/technical domain 
and 66,5 % for the nonspecialized domain, the latter being further split into journals 
(56 %), fiction and poetry (10 %) and other types (0,5 %). 

A revision of these findings, based on a number of resources, have been reached, 
modifying and, importantly, specifying the previous figures. In this sense, CNC might be 
called now to be a representative corpus which has been carefully planned from scratch. 
This stands in sharp contrast to corpora where any available text, preferably newspapers, 
is accepted, amassed into an amorphous entity and called a corpus. These rather spon- 
taneous corpora do rely on a seemingly infinite supply of texts and the philosophy of 
great numbers, hoping somehow that even the most specific and specialized information 
might find its way into it eventually. For many reasons, this could not be the Czech 
philosophy. Hence, these figures arrived at by this reaserch, which should be further 
scrutinized and revised, of course, are now being used for the fine-grained construction 
and implementation of the synchronic corpus SYN2000. The overall structure is shown 
in Table 1 . 



Table 1. The structure of SYN2000 



IMAGINATIVE TEXTS 


15% 




Literature 


15% 




Poetry 




0,81 % 


Drama 




0,21 % 


Eiction 




11,02% 


Other 




0,36 % 


Transitional types 




2,6% 


INEORMATIVE TEXTS 


85% 




Journalism 


60% 




Technical and Specialized Texts 


25% 




Arts 




3,48 % 


Social Sciences 




3,67 % 


Law and Security 




0,82 % 


Natural Sciences 




3,37 % 


Technology 




4,61 % 


Economics and Management 




2,27 % 


Belief and Religion 




0,74 % 


Life Style 




5,55 % 


Administrative 




0,49 % 
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Any comparison with the few available data from elsewhere is difficult, since it 
obviously depends on how various subcategories are defined. There is no consensus 
as to what might be viewed as Leisure, for example, the term used by the BNC and 
represented there by over 15 %, which must overlap with what is termed Life Style in 
CNC (although its representation happens to be three times lower). Categories where 
more consensus is to be expected are matched more closely in Czech and English, 
however, such as Natural Sciences (BNC: 3,63 %, CNC: 3,37 %). However, the overall 
hgures for Imaginative Texts are not conformant (BNC: 27,05 %, CNC: 15 %) and it 
seems that there is a strong and inexplicable bias in English for this. Superficially, this 
might imply that the English may not read newspapers and magazines so much while 
the Czech do (BNC periodicals: 29,67, CNC journalism: 60 %), but there is, probably, 
a much better explanation to be sought in the compatibility of the categories. 



3 Desiderata and an Outlook 

Study of language, which is vastly made easier and reliable by the existence of corpora 
now, has always been based on identification of regularities, rules and relations de- 
duced from language data. It seems now, that the time for a dramatic change in linguistics 
has come, permitting one to study fully the syntagmatic and combinatory aspects of 
language at last and to redress thus the balance and past practice, which has always been 
bent on paradigmatics, categorizing, classihcation etc., without really having enough 
data for doing this. However, the obvious part played in this by the researcher’s intro- 
spection, has never been critically questioned and it is only now that one is able to 
see to what, often unwarranted, degree this has been applied, contrary to facts one has 
nowadays. The assumptions made by linguists and language teachers about language 
on the basis of really few examples are staggering. Equally problematic are now past 
judgements about what is correct and what is wrong in language, a subject dear to hearts 
of all prescriptivists. The abyss between wishfull thinking, recommended to and even 
imposed on the others in some cases, and real language facts is considerable. It is now 
clear that major revisions of old pre-conceived and unwarranted claims and made-up 
descriptions are due to come. The language reality is shown by corpora to be no longer 
black and white with fine distinctions and simple truths built on these. 

For short, by linguists all sorts of people dealing with language professionally are 
meant here, including pure linguists, language teachers, lexicographers, artificial intelli- 
gence people, informaticians and the like. But it is not only linguists who are and might 
hnd today’s and, even more so, tommorrow’s corpora useful. All of our oral culture of 
all denominations and branches and its history is mediated in language and stored in lan- 
guage corpora. Hence, many other professions will eventually have to resort to corpora 
for information. A necessary and inventive step has to be seen in making them publicly 
accessible, especially to young people in schools, CNC being the first one to have been 
offered to schools on a large scale. Needless to say, Czech students have already found 
it useful for their studies to an increasing degree. There is now a double public access 
offered, both free of charge. The hrst is limited to a 20 million part of S YN2000 offering 
a simple type of search, called PUBLIC, and is recommended for getting a fast and brief 
insight. The other one, the FULL approach has to be negotiated with the Czech National 
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Corpus Institute at Charles University and, upon signing a written agreement, full access 
is being given to anyone with an academic and non-commercial interest. Information 
about both is to be found at the web-pages of CNC 

http : / /ucnk . f f . cuni . cz 

where also other information and the manual for users is to be found and downloaded 
(see also Cesky narodni korpus 2000). 

To compile a corpus is or rather may be a problem of finding a set of mutually 
compatible solutions to multiple questions, if ever asked, and it happens to be a major 
and costly decision followed by hard work based on these decisions. Each of these may be 
viewed as a point which has been critically argued and eventually set on a scale between 
rather opposite options, as a more or less definite view of it taken in its particular context. 
Although some of the points tend to be stressed and discussed more and some less, it is 
hard to overlook some new points and aspects which have not been so obvious before. 

Thus, it is hard to see why most, in fact almost all, corpora are seen as strictly time- 
limited projects only. When finished and having served their primary goal, they are, 
then, far from being kept-up, modernized and substantially enlarged. The notable and 
only exception seems to be the recently released BNC- World edition, yet that is, in fact, 
not a continuation but a mere modification due to other reasons. 

This common practice of abandoning the field and goal, however understandable in 
its particular contexts, is difficult to understand in a broader perspective. Admittedly, 
every language needs a consistent, perpetual and next-to-exhaustive coverage in its data, 
and it should have a corpus of corresponding qualities, although, in practice, this may be 
a gradual business of taking many minor decisions in the course of its construction and 
maintenance. This is particularly important in the case of minor languages which, unlike 
English and other languages, cannot afford the luxury of having a variety and multitude 
of corpora, at least not at the moment. Given the ever-growing demand of, even hunger 
for, more data and information, it seems reasonable that large corpora should be viewed 
as standing tasks for languages and language planners and due and permanent support 
should be found and extended to them. It is equally obvious that corpora should be 
supported by a central, preferably state-based bodies, since the task-oriented and short- 
term grant policy has been shown not to match this formidable task and goal. 

While the original English experience in corpora may be of much inspiration here 
first, both good and less good, it becomes now a major problem if seen in a different 
context. Clearly, where language-specific matters come up and where there is a need 
felt to make corpora of very many different languages compatible and usable side by 
side, a conflict must evolve which is basically of a typological nature. What is involved 
is not only what is already gradually being recognized as being rather different, i.e. pro- 
fusion of endings in the inflectional and agglutinative languages, some of them marking 
explicitly categories and functions which may be inherent in some other, hence the 
enormous incompatibility of tag-sets for various languages. But there are other matters 
to be reconsidered. A sobering experience, made by people from the CNC, has been 
the stochastic, hence automatic approaches to analysis, tagging and lemmatization. The 
S YN2000 which has been tagged and lemmatized in this way, having originally adopted 
this idea from elsewhere, is now being painstakingly corrected by hundreds of rules 
and subroutines based on analysis of uncomfortably great and unacceptable number of 
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mistakes. Clearly, this is a language-specific method and an approach, which works well 
for a non-inflecting language with a fixed word order, performs rather poorly for Czech, 
a highly inflecting language with a free word order. 

Due to growing scarcity of text resources, often, but not always, caused by com- 
mercial barriers and copyright obstacles, it may prove to be necessary in future, that 
new resources, such as Internet, will have to be sought and the policy of using samples 
of texts only will have to be abandoned. The affluence of data is an illusion which may 
seem, superficially, true for great languages like English or German. Certainly, this is 
not true for languages with much lesser number of speakers where achieving a desired 
size and number of texts for a corpus may prove to be increasingly more difficult, if not 
impossible. Obviously, spoken corpora will have to take over in future, despite their 
prohibitive price now. After all, most people still prefer speaking to writing and there is 
more data to found right there. 

Hopefully, CNC will go on growing and expanding and an increasing number of 
users and new uses will be found. Obviously and for future versions of it or at least 
some parts of it, links could and should be established to other sources to make it 
a pivotal centre leading one, if necessary to more specific information elsewhere. Thus, 
accoustical counterparts for the spoken corpora are feasible already now, linking thus 
sound and letter, while, on the other hand, links to pictures, encyclopedias, dictionaries 
and other comprehensive information resources are an obvious possibility, too. 
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The texts which have gone through the preparatory stages, mentioned briefly above, are 
being tagged to form a SGML conformant text format where the following 14 levels of 
mark-up are recorded: 

1- type of corpus: synchronic, diachronic, spoken; parallel 

2- type of text: informative, imaginative or a mixture of both 

3- type of genre which is different in specialized and non- specialized language 
and is made of some 60 different types (e.g. drama, novels..., music, philos- 
ophy,. .industry, sport,... religion etc.) 

4- type of subgenre (such as text-book, criticism, encyclopedic etc.) 

5- type of medium (such as book, newspaper, script, occasional etc.) 

6- text type, i.e. in verse or not 

6- sex of the author if known (including a team) 

7- language (in case of foreign language texts) 

8- original language (in case of translation) 

9- year of publication and the first year of publication 

10- name of the author if known or that of the translator 

11- name of the translator, if any 

12- name of the text/work 

13- place of the publication 

14- identification of the part of a larger work/text 
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Abstract. It is commonly agreed nowadays that information structure is a crucial 
aspect of the linguistic meaning of a sentence and its interpretation in discourse 
context. Here, we present an information- structure sensitive account of discourse 
interpretation, formalized as a hybrid modal logic, which can be smoothly re- 
lated to a dependency grammar. This formalization creates a bridge between the 
Praguian approach to information structure and contemporary accounts of dy- 
namic discourse interpretation. 



1 Introduction 

Information structure (IS) is an inherent aspect of the linguistic meaning of a sentence 
that a speaker employs as a means to present some parts of the linguistic meaning as 
context-dependent, and others as context-affecting. The Prague School has long argued 
for the need to consider IS, and to provide an integrated description of both the effects 
IS has on discourse interpretation, and its realization through various structural means 

[12] . With that, the Prague School approach differs from those who either consider only 
a syntactic description of IS (following the Government & Binding tradition), or only its 
semantic description (after Karttunen, Rooth). Steedman’s theory of IS [13] is similar 
in spirit to the Prague School, although its formal description differs. 

Because of the context-dependent/affecting nature of IS, it is most natural to con- 
ceive of its interpretation in a dynamic discourse-oriented fashion, such as Discourse 
Representation Theory (DRT, [16]). Recently, there have been various proposals for for- 
malizing this idea. Vallduvr [14] proposes an instruction-based dynamic account using 
File-Change Semantics, which Dekker and Hendriks [2] reformulate in DRT. Steedman 

[13] proposes a semantics for IS using alternative sets, and also discusses their dynamic 
interpretation. Peregrin [11] offers an extensional dynamic interpretation of the Praguian 
notion of IS, which we [7,6] have extended to an intensional DRT-based account. 

These proposals suffer from various problems. Peregrin and Steedman do not specify 
any discourse model, while Vallduvr relies on a discourse model that has been criticized 
by various authors (e.g., [2]). Furthermore, Peregrin’s, Vallduvr’s, and Dekker and Hen- 
driks’ accounts lack a relation between IS and its realization (grammar). Steedman’s 
grammar treatment concentrates on intonation as a means of realizing IS; the related 
treatment of word order proposed by Hoffman is problematic [5]. Our earlier accounts 
also do not have a fully worked out relation between IS and its realization (grammar), 
and the representation of IS in complex cases is flawed. 
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The present account attempts to overcome these problems. We offer a new formal- 
ization of the Praguian notions of IS relating a detailed account of IS realization [5] with 
a computable fragment of hybrid modal logic [1], in which we develop an IS-sensitive 
approach to dynamic discourse interpretation. By this we hope to make the Praguian 
notions relating IS more explicit and precise. 

Characteristics of the approach. We assume that the linguistic meaning of a sentence is 
specified as a dependency tree, where dependency relations (“semantic roles”) connect 
heads and dependents, and informativity is specified for each node. In fhe presenf paper, 
we only distinguish fwo fypes of informalivily: contextually dependent (bound, CB) vs. 
contextually affecting (nonbound, NB).' Abstractly, it is possible to represent a depen- 
dency tree with a head and dependents (5i, ..., Sn of kinds pi , with informativity 
i-head,C, ■■■,1-2 as a formula in a modal logic: [i-head]headA[Li]{pi)Si A ...A 

We employ hybrid (modal) logic [1] as the basis for our formalization. It allows us 
to distinguish different sorts of propositions besides multiple unary modal operators. 
Particularly, one can distinguish nominals. A nominal is a term that enables one to refer, 
in the object language, to a state in the underlying model. The jump-operator @ makes it 
possible to jump to a state named by a nominal i and evaluate a formula f there: We 

adopt Kruijff ’s formalization of the basic DRT notions in hybrid logic [5]: Nominals are 
interpreted as discourse referents that are bound to propositions using the jump-operator. 
Consider example (1). (la) shows its representation structure according to DRT. (lb) 
shows a hybrid logical formula derivable for (1), [5]. (v models the speaker’s vantage 
point.). ^ 



( 1 ) 



(What does Kathy study?) Kathy studies mathematic-S. 



b. 



study(ci,x,y) 

kathy(x) 

math(y) 

@„([cb](s a study) A [cb](Actor)(A: A Kathy) 
A [NB](PATIENT)(m A math)) 



A discourse representation structure (DRS) consists of a list of discourse referents 
and a list of conditions, each of which can itself be a DRS. Co-reference between an 
anaphoric expression and its antecedent is captured by binding of discourse referents. 
A DRS can be viewed as a predicate logic formula where the discourse referents are 
variables bound by the existential operator. However, merging DRSs is more involved 
than simply conjoining formulas, because it allows binding a discourse referent intro- 
duced in a DRS K 2 to a discourse referent introduced in a DRS Ki as long as K 2 is at 
the same level or subordinate to Ki (cf. the continuation of (1) with (2), and the merging 
of their DRSs in (3)). 

* In [5], four types of informativity are distinguished: CB and NB, CB* and NB*, where CB* 
corresponds to the c marker in [3] or Steedman’s Theme-focus [13]. NB* corresponds to the 
focus proper in [12] or Steedman’s Rheme-focus [13]. 

^ Small capitals in example sentences indicate words carrying pitch accents. Each example 
sentence is presented as an answer to a question, which helps to indicate the intended IS. 
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(2) It FASCINATES her. 





ei,x,y 




(3) 


study (e, ,x,y) 

kathy(x) 

math(y) 


<8> 



C2,u,0 



\faacinaie(t^,u,v)\ 
u=f 
v=f 





ci,ei,x,y 


-y 


study (ei ,x,y) 
kathy(x) 
math(y) 
fascinatr.(y,x) 



Dependency Grammar Logic (DGL, [5]) provides a framework in which represen- 
tations like (Ih) are obtained in a compositional, monotonic way from an analysis of the 
sentence’s surface form. Because IS is represented explicitly, the representation of the 
sentence in (1) is different from that for the sentence in (4) (unlike the DRS, which is 
the same). 

(4) (Who studies mathematics?) Kathy studies mathematics. 

a. same as (la) 

b. ©„([cb](s a study) A [nb] (Actor) ( k A Kathy) 

A [cB](PATIENT)(m A math)) 

In this paper we do not deal with the realization of IS (cf. [5]). Rather, we focus on 
how to interpret linguistic meaning, represented as a hybrid logical formula like (lb) or 
(4b). In §2 we show how we derive the IS-partitioning and we describe the IS-structures 
that our approach allows to be derived. In §3 we give a formalization of the dynamic 
interpretation of IS in our approach. 



2 Topic/Focus Articulation 

DGL follows the approach to IS advocated by the Prague School [12]. IS is hereby 
referred to as the Topic/Focus Articulation (TFA) of a sentence, and is derived recursively 
from the indication of informativity of the individual nodes in its linguistic meaning. 
A moderate form of recursivity of both the primitive notions (CB/NB) and TFA is 
employed. 

To establish TFA, we use rules that rewrite a logical formula just indicating infor- 
mativity to a logical formula with the TFA-bipartitioning, represented as T N .7^ [6]. 
The idea of using rewriting stems from Oehrle [9], but the rules implement the TFA 
recursion defined in [12], with the amendments of [3](p.l64). 

In our hybrid logic formalism, we represent TFA as @^(T N F). T and F are 
conjunctions of terms, whereby T may be empty (written as T). By definition, F must 
not be empty;^ if at any point an empty focus (written as _L) is obtained, a deeper 
embedded NB element is sought to serve as focus. A rewrite rule is stated as TZ{4> V')> 

rewriting (p into Containment C[-] is not recursive, but pertains only to the current 
level of conjunction. 

^ Steedman [13] discusses examples with IS-partitioning corresponding to “all-topic” which are 
not allowed for in [12] (and the current formalization in DGL). 
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(5) If a verbal head of the clauac is CB. then it belonga to the topic. 

■R(aA(lcB|(f A A <P) «a(Ich](£: a N #)) 

(6) If a verbal head of the clauac ia NB, then it belong) to the focua. 

«(aA(lNn|(£: A A if) «a(t m [nb](£: a a #)) 

(7) If a dependent J of a verbal head ia CB, then S belonpi to the topic 
(including any nodea it gwema). 

M [cb]J a if) •-> a/i([CB]i A <P M !f)) 

(8) If a dependent j of a verbal head ia NB, then S belmga to the focua 
(including any nodea it gm’ema). 

■R(Qa(# M [nb]^ a !f) *-♦ aA(# N if A [nb]<5)) 

(9) If a CB depmdent of type J ia an embedded clauae, then it ahould be 
placed firat (topic proper). 

•R(aft(if[[cB](<5)(r[(£’Ax)])] M <f) 1-4 afc(icB](<5)(r((f Ax)])Aif m if) 

(10) If a NB dependent of type ^ ia an embedded clauae, then it ahould be 
placed laat (focua proper). 

7Z(aft(# M if[[NB](j)(r[(£ Ax)])]) afc(if m if A[NB](rf)(r[(f ax)])) 

(11) Embedded focua: If in if M if, if containa no iimer participanta (i 6 
{Actor, Patient , AocRtssEE, Ekkbct, Origin }) whercaa if doca, 
then a NB modificatim of a CB dependent ia part of the focua: 

72(%(if[[cB](<5)(r[[NB](<5')(d A ^)])]Mif) 

^ Qfc(#[[CB](J)(rO)]M[CB](<5)[NB](J')(d A 41) A If)) 



A valid TFA is a structure M to which we can no longer apply any of the rewrite 
rules given in (5) through (11), and where S' _L. 

TFA Structures. Example (12) illustrates the application of the above rules. We obtain 
a relational TFA structure, where the relata may be distributed across the N operator, 
while maintaining their mutual relations through nominal reference. 

(12) (What did tho cat cat?) The cat ate a SAUSAGE. 

Ok([Nii](£ A eat) A [cn](A<rroR}(c A cat] A |nii](Patiknt}(a A aaus)) 

(6) ,0*(T M |nb](£' A eat) A [cn](AcrroR)(cAcat) A |NH](PATiKNT}(jiAsaus)) 

(7) ,G':b([cii](AcTOR)(c A cat) M |nh](£' A eat) A |nh](Patii'J4T)(ji A saus)) 

We also obtain the desired TFA for examples like (13): 

(LI) (Whidi teacher did you meet?) I met the teacher rf CHEMISTRY. 

Oi(|cB](f A meet) A [cH](.A(rTOR}(t A I) 

A |ai](PATiKNT)(J A teacher A [NB](.Ai’PUiirKNAN(:K)(c A chem))) 

(5).(7),(7),(ll),0*([cB](f A meet) A |cB](.A(rroR)(« A I) 

[<:b](Patiknt)(J a teacher) M |cB](PATiKNT}|NB](Ai’PUKniN anck)(c A chem)) 

The rewriting in (13) relies crucially on rule (11) handling embedded foci. The 
formulation of rule (11) differs from Sgall et al.’s [12], in that it is a generalization 
similar to that proposed by Koktova [4]. (11) enables us to deal properly with examples 
like (14), which answers a so-called double-focus question. 
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(14) (Wluim did jou giw what book?) I gave the book on SYNTAX to K ATHY. 

0*(|cn](f A give) A |cB](AtrroR)(i A I) 

A [cm](Patiunt)(6 a book A |Nn](AppuitrKNAN<:K)(« A «yn)) 

A |Nn](Ai>i)Ri<.<«Kl!)(JI; A Kathy)) 

(5).(7).(7),(11).®a([‘:b]( 5 A give) A [cB](AcrroR)(t A I) 

A |cB](PATiUN-t)(6 A book) M [cb](Patiunt)[nb](Ai’I>urtknanck)(« a ^n) 

A [rin](Ai>i>RKRSBi!}(JI; A Kathy)) 

Examples (13) and (14) can be combined to form (15), which can be analyzed 
straightforwardly in the present framework. In contrast, (15) seems impossible to an- 
alyze in Vallduvi’s account [14], and it is not clear to us how it would be handled in 
Steedman’s account [13]. 

(15) (Whidi teacher did you gi%e what book?) 

I gaw the txxik on SYNTAX to the t«!acher of English. 

Cn([cB](f A give) A [c:b] (.Actor) (« A 1) 

A [cm)(Patiknt)(6 a book A 1 nb](.Appurtkn anck)(* A syn)) 

A [c»](.A»ORiS!iKK)(l A teacher A |NB](.AppuKri<NANCK)(c A English))) 

(5).(7),(7).(7).(ll).(ll),®A([<:B](f A give) A [aj](.A<rroR)(i A 1) 

A |(»](Patiknt)(& a book) A |t;B](.Aoi>Ri«!ii!K)(( A teacher) 

M |c;b](Patiknt)[nb](.Appurtknanci!)(s A syn) 

A |a)](.\l>ORISSI!I!)|Nn](.ApPUHTKNANCK)(c A English)) 



Formally, nominals ensure that dependents and heads remain properly linked. This 
avoids the problems of a typed approach like Kmijff-Korbayova’s [6] when complex 
TEA structures like (14) are concerned. 



3 Dynamic TFA-Sensitive Discourse Interpretation 

Following dynamic approaches to interpretation of IS, e.g., [6,11,13], the M-operator 
which connects the topic and the focus controls how the sentence is interpreted with 
respect to a discourse model. Essentially, a hybrid logical formula of the form T M F is 
interpreted by first evaluating T against the current (discourse) model, and only if T can 
be interpreted, evaluating F. 

We define a discourse structure D as a structure {T>, {Fd}, HB, P). HB is a hybrid 
logical back-and-forth structure with spatial extension [5]. P is a sorted structure on 
which we interpret objects and properties, which may overlap with MB’s sorted spatial 
structure. 2? is a set of nominals understood as points in a discourse, and {2 ?d} is a set 
of relations modelling models in D. {2 ?d} includes at least <, the relation that defines 
a total order over the nominals in D. 

A discourse model 9J1 d is a tuple (D, V) with D a discourse structure and V a hybrid 
valuation. A linguistic meaning represented as @ft(T ixi T) is interpreted by evaluating 
TId,w ^ @h{‘T where w is the denotation of h. This evaluation proceeds as 

follows: from the point reached by the preceding discourse, various “discourse continu- 
ations” are possible (i.e., various states w are accessible under R). We first evaluate the 
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topic, i.e., 931 d, w H ®h{T), which intuitively means selecting the state(s) w on which 
the topic can be positively evaluated. Then, we continue by evaluating the focus on w, 
given that holds, i.e. OJId, w |= '^h{T A iF). If the topic-evaluation phase fails, 

the linguistic meaning is infelicitous in the given context. If the topic-evaluation phase 
succeeds, but the focus-evaluation phase fails, the linguistic meaning is false in the given 
context. 

This gives us a basic dynamic model of interpretation, modeling ixi similarly to 
Kruijff-Korbayova [6](p.79ff) or Muskens et al.’s dynamic conjunction [8]. The impor- 
tant step now is to define a notion of discourse accessibility. Without that we could not, 
for example, resolve the pronominal anaphors in (2). 

Discourse referent accessibility. For the discourse model to support a formula f, i.e., 
9 ?Id, w 1= @h{f) we need to specify the meaning of contextual boundness. We first of 
all have the following standard definitions: 

931, w \= (j) A tp iff Tl, w \= (j> and 931, w \= tp 
931, w ^ {Tt)4> iff 3w'{wRt^w' & Wl,w' |= </>) 

931, w 1= [Tr](j) iff \ku' ( wRt^w' 93t, w' |= 4>) 



For the modal relation CB we define the accessibility relations Rcb as follows: 
wRcbw' & 931, w' \= (j) means that there is a state pm'D,p<iw, such that at p either 
931, p ^ [cb](/) or 93t, p ^ [nb](/). The accessibility relation connects D with H B, P, 
over which it models universal accessibility.(Cf. [5] for details.) 

Next, for a dependency relation 6 the accessibility relation Rs is defined from H B U P 
to HB U P, interpreting {S)(p on a state s that is of the right sort given f. 

If we consider the generic discourse (modal) relation 91 to be modeled with < as its 
underlying accessibility relation, we can specify discourse accessibility: 

(16) 0,(A'5)a A A A 0,Ii'](<5')a, 

for any 6, 6' € -A, 4,i' € tii} . 

Umng accmmbillty, the repreHentation of (2) in an folkm's: 

(17) Oj»([cn](AcrroR)(p A Op(x!5)(a Aaiyerf)) A [ch](Patihnt)(p' A Op-(xs)(a'A 
/ernale)) M [nii](/ A fascinate)) 



Salience. The salience of an item a; at a current point in the discourse, h, is defined as 
follows. If a; is NB under h, then the salience of x is 0: Salience(h[NB]x)=0. If X is CB 
under h, then the salience of x is defined as follows. Let <* be the non-reflexive transitive 
closure of <i. Let x) be the length of the path from h to h', with h' <i* h and x NB 
under h' . Let R^{h, x) be the length of the path from h to h”, with x last CB-refered 
to under h”. If there is no h”, then R^{h, x)=0. The salience of a CB item x under h is 
calculated as in (18). 
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SalIenoe(h |ch]x )= 



1 



if X is reaCnod as a pronoun 
if X is realisod as a definite noun 



This model of calculating the salience of discourse referents is a simplifed version 
of that in [12]. The point here is to show that one can incorporate a salience metric into 
the present framework and use it during interpretation as a partial ordering over possible 
antecedents, at every point in the discourse. If for a CB item under h there are several 
possible antecedents to which it could be referring, then by definition the most salient 
possible antecedent is preferred. 

Discourse Merging. Given a specification of linguistic meaning for an individual sen- 
tence, @h{T ixi IF), and a representation of the discourse so far, £>, both formulated as 
hybrid logical formulas. By definition we have that TIq H S. The empty discourse is 
modeled as T. The merge-operator 0 for merging D and @h{T ixi T), TnS)@hi'T ^ IF) 
is defined as follows. 

i. If 33 = T, then take a nominal d 6 D, designate d as current(d), and 

interpret (equating d and h, ^jh). Let D = ®j{T ^ /") 

ii. If ® / T, then take a nominal d 6 D for which it holds that d! < d, 

current (d^. Evaluate T). Let 3? = I) A @j>(fH)d A ®d(T ^ ,F), 

and set current(d). 

We illustrate the result c£ discourse merging for the discourse (1+2) below: 

( 19) d,i([CB](s A study) A [cb](Patirnt)(t/i A math) M [n b](Actor)(A: A Kathy)) 
A A Clrf<([ci^(AtrrOR)(ni A math) A [cb](Patirnt)(A: A Kathy) M 

[nb](/ a fascinate)) 



4 Conclusions 

We gave an intensional logical formalization of IS and its dynamic interpretation, and 
showed how the Praguian views on representing and interpreting IS can be employed 
in a fully formalized setting oriented towards dynamic discourse processing, with the 
perspective of providing an efficient implementation [1]. 

The formalization incorporates various aspects that are novel, or that overcome short- 
comings of previous approaches. For example, the approach explicitly incorporates a 
basic notion of salience in interpretation, and it is capable of dealing with the represen- 
tation of complex IS phenomena like double foci and embedded foci (unlike e.g. [14]). 
Furthermore, the approach can be smoothly related to a dependency-based grammar 
formalism (DGL) for the realization of IS [5], without having to make simplifying as- 
sumptions as in [ 1 6] . When taken altogether, this enables us to account for the realization, 
representation, and interpretation of IS in a compositional and monotonic fashion. 
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Abstract. The performance of taggers is usually evaluated by their percentual 
success rate. Because of the pure quantitativity of such an approach, all errors 
committed by the tagger are treated on a par for the purpose of the evaluation. 
This paper takes a different, qualitative stand on the topic, arguing that the previ- 
ous viewpoint is not linguistically adequate: the errors (might) differ in severity. 
General implications for tagging are discussed, and a simple method is proposed 
and exemplified, able to 

1 . detect and in some cases even rectify the most severe errors and thus 

2. contribute to arriving finally at a better tagged corpus. 

Some encouraging results achieved by a very simple, manually performed test and 
evaluation on a small sample of a corpus are given. 



1 Introduction 

The crucial task of tagging is the resolution of morphological ambiguity of words in a 
text (text corpus) - either on the level of part of speech of a word (e.g., the English word 
lead has to be disambiguated between the nominal and verbal reading) or on some finer 
level (e.g., the German noun Letter is ambiguous between feminine and masculine, the 
Latin form feminae is either singular genitive or dative or nominative plural, etc.). The 
most wide-spread taggers are based on statistical approaches (e.g., bigram or trigram 
taggers, maximum-entropy taggers), that is, these taggers “learn” tagging from manually 
tagged texts, some other taggers try to learn the rules from a text or use hand-written 
rules (Brill-style taggers. Constraint-grammar-style taggers). 

However, resolution of morphological ambiguity is by far not an easy task, and in its 
entirety, it indeed would require full understanding. Fortunately, for a lot of practical pur- 
poses ambiguity resolution which performs with an “acceptable” (whatever this means 
in the particular case) error rate is sufficient. On the other hand, clearly such tagging 
systems are to be preferred which make “less” and/or “lighter” errors. This calls for a 
linguistic analysis of error sources - the detailed outcome of which might be different for 
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different languages, but which (at least in the case of most of current statistical taggers) 
has a common denominator: the locality of processing. In other words, most errors come 
into being due to the fact that very often the tagger is able to take into account only a 
limited number of adjacent words, and it is unable to process any kind of information 
“hidden” in the sentence - be it even the most trivial one - which is not recoverable from 
the local context. 

This points to the direction that the taggers based on strictly local processing will 
have to be replaced by other systems (probably, either purely rule-based, or combining 
statistical and rule-based approaches) in our better future. But, for the time being, we 
have to get by with what we have, and given the general estimation that writing a 
reasonable Constraint-grammar-style tagger takes about ten years, we still shall have to 
for some time. Nevertheless, we should attempt at making our current real world as good 
as possible. In the world of statistically tagged corpora, this means that we should either 
make our taggers better (which, however, is a difficult task after a certain threshold has 
been achieved) or we should try to make better (rectify) the results which we get by the 
current taggers, or at least point out the errors which are contained in their results, in 
order to enable manual correction. 



2 A Linguist’s View of the Quality of Performance of a Tagger 

The success rate of (statistical) taggers (and tagging methods) seems to be, as a rule, 
assessed only quantitatively, that is, by the percentage of successful hits on a testing 
corpus (hits of the “truth”). Apart from the fact that, taken strictly, the percentage of 
“hits of the truth” does not in fact tell us much even about the quantitative performance 
of a tagger (see below), the problem which thus remains undiscussed is the qualitative 
assessment of the results. In other words, it seems that only little work has been done 
yet in the area of linguistic evaluation of the quality of (results of) a statistical tagger. 
This paper shall point in this direction, on the particular case of the tagging results of 
the TnT system (Brants, 2000) as used onto a corpus of German texts, and also draw 
consequences from the facts established. 

2.1 Preliminaries: The “Success Rate” and What It Really Tells Us 

The first problem we encounter is the interpretation of the figures representing the success 
rate of a tagger. Taken strictly, the percentual figures as usually presented do not contain 
any clear information about the linguistic performance of the tagger. These figures are 
dehned as the ratio of the number of wordforms with correct tags assigned by the system 
within a test corpus to the number of all wordforms in the test corpus. For example, the 
information that a tagger has a success rate of 98% is to be understood in the way that 
for a given piece of test text, the tags assigned manually and the tags assigned to the 
text by the tagger are identical on 98% of the words. This shows why such figures are 
misleading: even in the (unrealistic) case that the results of the hand-tagging are error- 
free, the coincidence of tags does not provide any information about the performance 
of the tagger alone, but only in combination with the ambiguity of the text. That is, even 
a tagger which assigns systematically incorrect tags whenever it has to choose between 
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two (or more) candidate tags for a word (not that such a tagger exists!) would achieve a 
100% success rate on a text which is unambiguous' . But these two pieces of information 
(about the performance of the tagger and about the ambiguity of its input) cannot be 
untangled - and hence the current practice of measuring success rate of taggers has 
very little to say about the real reliability of the particular tagger in question. Because 
of this, it would be more adequate to measure the performance of the tagger as the 
ratio of the number of ambiguous wordforms with correct tags assigned by the system 
within a test corpus to the number of all ambiguous wordforms in the test corpus. This 
would provide the relevant information about the performance of the tagger on a given 
text in a clean form - however, at least to my knowledge it is never made publicly 
available^ . The situation with the reliability of the results of a tagger would, however, 
remain unsatisfactory even if this information were accessible. The point here is that 
such a ratio gives information about the average reliability of a tagging result, but cannot 
provide any information altogether about the reliability result of a particular wordform 
(about a particular position in the corpus), which is often the relevant information for a 
linguist. That the statistical taggers are (at least currently) unable to meet these and other 
linguistic requirements is the reason why they commit a lot of “grave” errors, and in 
addition it gives them a questionable status form the viepoint of methodology of science: 
they seem to be devices which try to “guess” the result, causing uncertainty, even if the 
correct answer can be arrived at rigorously and above all unequivocally. Apart from this, 
such kind of uncertainty gives rise to a quite practical problem: if the correctness of result 
of no single ambiguity resolution can be guaranteed, and the corpus should be hand- 
checked for correctness, then it is necessary that the tags of all ambiguous wordforms be 
checked. This obviously contrasts with the situation which occurs provided the tagger 
were able to distinguish between {absolutely) certain and uncertain results. 

2.2 Long Distance Errors 

Let us now focus on errors which statistical taggers commit, and on the possibilities of 
correcting them. First, we shall take into consideration tagging errors which come into 
being due to interdependence between words standing “too far” from each other to be 
tagged correctly. For this purpose, let us consider the following examples taken from the 
results of the trigram-based TnT system; however, the results of a similar test performed 
for a corpus of Czech tagged by the maximum-entropy method show that the underlying 
nature of the statistical tagger plays no role. (The tags given in the parentheses are the 
tags wrongly assigned by the tagger.) 



Wrongly tagged “noch”. Der Staatssekretar im Wirtschaftsministerium, Alfons Zeller, 
gofi durch unbedachte Aufierungen noch(Conj) 01 ins Feuer . 

* Since the tagger did nothing (there was no ambiguity to be resolved), it should be in fact natural 
to require that the performance of any tagger on such a text remain simply undefined. 

^ However, even such a measure still does not eliminate the possible influence of the complexity 
of the text onto the result: is intuitively obvious that it is more difficult to perform tagging in 
a text where each word is (doubly) ambiguous than, e.g., to tag a text where at most one word 
per sentence is (even ten times) ambiguous - but this distinction is not taken into consideration. 
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Das Bahnreformkonzept der Bundesregierungfiihrt in ein dkologisches, verkehrspo- 
litisches und gesellschaftspolitisches Desaster, well es weder in das dkologische Gesamt- 
konzept eingebunden ist, das den dkologischen Zielsetzungen (der Bundesregierung 
und der Bundestags-Enquete-Kommission zum Klimaschutz) hinsichtlich des Beitrags 
der Schiene zum Personennahverkehr, Personenfernverkehr sowie zum Giiterverkehr 
entspricht, noch(Adv)(iie wesentlichen Voraussetzungen schajft, um die Existenz, Mini- 
malqualitdten und die notwendigen Verbesserungen des Nah- und Regionalverkehrs auf 
der Schiene zu sichern. 

The point is that the word noch is ambiguous between an intensifying adverb and 
the second member of the conjunction pair weder - noch. The linguistically trivial 
observation (however, of non-local nature) is that noch must be an adverb if it is not 
preceded by weder in the sentence, and that it must be a conjunction if it is the only 
occurrence of noch in a sentence where it is preceded by weder. 

“Second position” too far from the beginning of the sentence. Einer Studie zufolge 
fiihren(Vinf)f das in Europa verbotene Pestizid DDT in extrem hoher Konzentra- 
tion. 

The (contingent) reason for the error here is that the veihform fiihren ist “too far” 
from the sentence beginning and hence its status cannot be decided correctly using only 
local context. However, since globally it is not preceded by any verbal material, by the 
particle zu and neither by anything which might introduce a subordinated clause, and it 
is directly followed by nominal material, it must be a finite form of a verb. 

Complementizer too far left from the (finite) verb. Eine norwegische Hotelkette 
will ihren Gdsten den Ubernachtungspreis zuriickerstatten, yy&xadiese den Aufenthalt 
nachweislich zur “Kinderproduktion” genutzt haben(Vinf). 

The observation here is that the tagging result breaks the rule that, in any German 
sentence, the complementizer has to be followed by a finite verbform - if there is only 
one candidate (as it is in this example with the verb haben), then this candidate has to 
be assigned this role. 

Relative pronouns. ’’Uberall wimmelte es von Schweinen und deren(RelPron) Dreck, 
bis 1481 ein Erlafi die Schweinehaltung in ganz Frankfurt verbot und sogenannte, Hun- 
deschldger’ fiir die Dezimierung der Vierbeiner sorgten”, erlduterte Norbert Muller 
wdhrend seines Vortrags “Wie lebten die alten Schwanheimer ?” in der Stadtbucherei. 

Die ddnischen Stromunternehmen akzeptieren inzwischen ihre Doppelrolle, auf der 
(RelPron) einen Seite Strom zu erzeugen und auf der anderen Seite auf seine mdglichst 
effiziente Verwendung hinzuwirken. 

The configurations here are in both cases locally identical with the ones where 
a relative pronoun usually occurs, but globally, again, trivial constraints block such 
readings. Thus, in the first case, the word deren is not preceded by a comma (as it must 
be the case in a German sentence, even when this precedence need not be necessarily 
immediate), and in the second example, the would-be relative pronoun der should trigger 
a subordinated (relative) clause headed by a finite verb (in case it were a relative pronoun), 
which condition is not met. 
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2.3 Smoothing and Local Errors 

However, not only “long-distance” errors are to be found in a statistically tagged corpus. 
Also a considerable number of ’’local errors” occurs. As it seems, the source of many such 
errors might be the so-called smoothing, a method intended to be used for the tagging 
of local configurations (e.g., trigrams) which have not been seen in the learning corpus 
(and hence could not have been learnt from it). This seems necessary since without 
such a device, the tagger would not be able to assign (correct) tags in many of the 
configurations it encounters in “real” texts. This intended effect, however, is very often 
superseded by its opposite - namely, that impossible local configurations are dragged 
into the set of expected configurations, with the obvious consequence of using such 
erroneous configurations in the tagging process (and hence assigning incorrect tags). 
The damaging effects of smoothing can be clearly observed on local errors (wrongly 
tagged local configurations which fit into the “window” of the tagger, e.g., into one 
trigram). As examples of these errors we put forward the following cases: 

Preposition at the end of sentence. Wenige kamen allerdings nur aufierhalb Heusen- 
stamms unter(Prep). 

Von 18 bis 20Uhr steht der zweite DRK-Blutspende-Termin dieses Jahres in Eppstein 
unter dem Motto “Komm, mac/z ’ mit(Prep)” . 

The local configurations 
<preposition> <fullstop> 

<preposition> <inverted commaXfullstop> 
could not have occurred in the learning corpus, since such configurations do not exist 
in standard German - with the exception of the adverbial use of ohne in the collocation 
as used in Sie war oben ohne. (Mind that nach in meine Meinung nach is handled as 
a postposition in the tagset used.) 

“Weder - noch” within a trigram. Aber weder sie noch(Adv) die virtuosen Einsdtze 
des Kronos-Quartetts (streichen, zupfen, percussive Arrangements der Streicher) treten 
in den Vordergrund, sie verschmelzen. 

Noch beziehen sie, wie zu Kriegszeiten, weder Sold noch(Adv) Gehalt, erhalten 
aber inzwischen ein “Taschengeld” von 50 Birr (35 Mark) im Monat. 

It is impossible that any local configurations like the ones from the examples occurred 
in the learning data, just on the contrary, for sure if any weder was closely followed by 
noch, this noch had to be a conjunction, and it never clould be an adverb. 

Relative Pronoun. Als der(RelPron) sich spdter aufloste, Ubernahm das Ordnungsamt 
die Organisation. 

A relative pronoun has to be preceded by at least one noun and one comma in any 
German sentence, hence, it can never stand on the position of the second item. 

3 Linguistic Strategies for Rectification 

From the examples and considerations above, it is next to trivial to conclude that in many 
(even though by far not in all) cases the errors committed by a locally operating tagger 
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can be detected (and sometimes even corrected) by a machinery as simple as finite-state 
automata. Even, very often the more linguistically severe the error is, the easier it is to 
detect/correct it - since the severity is measured (of course not formally!) as the degree 
of violation of the most simple/basic regularities of language. Without elaborating more 
on this, we shall now present two examples of what could be practically achieved. 

As for the evaluation of the rectifying power of the strategies proposed below, some 
percentual figures of how many of the errors in the output of the TnT tagger could be 
discovered/corrected are given below. These figures are, however, only indicative, as they 
have been collected manually for the sole purpose of checking whether the development 
of a system which would perform the task would be worthwhile. This is to say, these 
figures have been arrived at in the following manner: 

1. the results of the tagger have been compared to the manually annotated corpus, and 
cases of sentences where the tagging results differed in the relevenat tag were picked 
and stored in a separate file 

2. this file has been looked through manually, deleting all cases of discrepancy due to 
an error in the manual annotation and leaving only sentences containing an error in 
the tagging by TnT 

3. the resulting file has been checked manually for cases where the rectifying/error- 
searching algorithm would apply. This search was finished after finding ten such 
cases or at the end of the file, whichever occurred first. The percentage figure has 
then been established as the ratio of the cases where the algorithm would apply to 
the total number of cases checked. 

3.1 Finite Verb at the End of a Subordinated Clause 
Introduced by a Complementizer 

Example'. Eltern und Lehrer der Schule beschwerten sich, dab die Gebaude nun doch 
erst 1995 erweitert werden(Vinf). 

Variant 1* 

1. ANYSTRING 

2. comma “ , “ OR “und” OR “oder” 

3. complementizer different from “als” 

4. STRING containing: NEITHER verb (of any kind - full, modal, aux, in any form), 
NOR complementizer, NOR “zu”, “als”, “wie”, NOR quotes (Einfiihrungszeichen) 

5. word tagged as infinitive verb 

6. STRING containing no verb (of any kind - full, modal, aux, in any form) 

ERROR : The tag in 5. is wrong, it should be replaced by the tag finite verb 
Variant 2: 

1. ANYSTRING 

2. comma “ , “ OR “und” OR “oder” 

3. complementizer different from “als” 

4. STRING containing: NEITHER verb (of any kind - full, modal, aux, in any form), 
NOR complementizer, NOR “zu”, “als”, “wie”, NOR quotes (Einfiihrungszeichen) 
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5. past participle of a full verb 

6. word tagged as infinitive auxiliary verb 

7. STRING containing no verb (of any kind - full, modal, aux, in any form) 

ERROR : The tag in 6. is wrong, it should be replaced by finite auxiliary verb 
Variant 3: 

1. ANYSTRING 

2. comma “ , “ OR “und” OR “oder” 

3. complementizer with clause different from “als” 

4. STRING containing: NEITHER verb (of any kind - full, modal, aux, in any form), 
NOR complementizer, NOR “zu”, “als”, “wie”, NOR quotes (Einftihrungszeichen) 

5. infinitive of a full verb 

6. word tagged as an infinitive of a full verb or infinitive of a modal verb OR the 

form “werden” or “wiirden” tagged as infinitive of an auxuliary verb, respectively 

7. STRING containing no verb (of any kind - full, modal, aux, in any form) 

ERROR : The tag in 6. is wrong, it should be replaced by finite full verb or finite moda 
verb or finite auxiliary verb, respectively 

General Commentary: Fully automatic application possible. The crucial point is that the 
verb stands at the very end of a subordinated clause and no verb precedes (within the 
clause), that is, it cannot be of the type ... dafi Cecilia seiner Freundin die Nilpferde 
wird filttern helfen konnen. As long as the requirement that the subordinated clause is 
not followed by any verb is met, the rule is reliable. Should this condition be relaxed 
(e.g., to a start of another subordinated clause), counterexamples might occur and human 
supervision becomes necessary. 

Efficiency: This rule is able to correct 12,73% of cases where the TnT tagger mistook 
finite form for an infinite form on the body of NEGRA corpus. 

3.2 Word Wrongly Tagged as Relative Pronoun 

Example: Die danischen Stromunternehmen akzeptieren inzwischen ihre Doppelrolle, 
auf der(RelPron) einen Seite Strom zu erzeugen und auf der anderen Seite auf seine 
moglichst effiziente Verwendung hinzuwirken. 

Variant 1* 

1. ANYSTRING 

2. word tagged as relative pronoun 

3. ANYSTRING not containing a word tagged as a finite verb 

4. END OF SENTENCE 

ERROR : The tag in 2. is wrong OR a verb in 3. (if any) is wrongly tagged 
Variant 2: 

1. SENTENCE BEGINNING 

2. STRING not containing a comma “ , “ 

3. word tagged as relative pronoun 

4. ANYSTRING 
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ERROR : The tag in 3. is wrong Variant 3: 

1. ANYSTRING 

2. verb (of any kind - full, modal, auxiliary, in any form) 

3. STRING containing NEITHER a comma “ ,” NOR a coordinating conjunction 

4. word tagged as relative pronoun 

5. ANYSTRING 

ERROR : The tag in 4. is wrong 

General Commentary: Application ONLY with human intervention. The procedure is 
able to detect an error reliably. However, human intervention is needed for the correction, 
since it is not clear what the correct tag should be 

Efficiency: This rule is able to detect 8,54% of cases where the TnT tagger mistakenly 
tagged an article or a deictic pronoun as a relative pronoun in the body of NEGRA corpus 
OR where (in Variant 3) the verbform was tagged incorrectly 

4 Conclusions 

The main aim of the research described was to prove the fact that the performance 
of statistical taggers can profit from addition of post-processing modules containing 
manually-coded linguistic knowledge (pre-processing modules or stand-alone symbolic 
taggers have been reported several times already). The effect expected is that of surpass- 
ing the threshold of accuracy of the current statistical methods, minimizing simultane- 
ously the time-consuming creation of purely symbolic systems. A non-negligible aspect 
of the work is also that the approach sketched helps to achieve tagging results free of 
linguistically trivial errors, that is, it helps to create corpora which can (hopefully) avoid 
the stigma of being created by sadistic(al) methods. 
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Abstract. This paper describes a part of one of the most important syntactic 
subsystems present in many inflectional languages - grammatical agreement - 
from the viewpoint of automatic morphological disambiguation of such languages. 
One of the languages on which the main ideas will be demonstrated is Czech 
which - due to its morphological and syntactic complexity - can be regarded 
as a representative of the inflectional subgroup of the Slavic language family. 
It will be shown that notwithstanding the intricacies of the syntax of Czech a 
deeper understanding of the nature of grammatical agreement can result in the 
development of surface syntax rules which can considerably contribute to solving 
the problem of automatic morphological disambiguation of texts stored in Czech 
corpora. Although the language being studied is only Czech the ideas presented 
seem to be applicable, mutatis mutandis, also to the morphological disambiguation 
of a si-milar type of languages, especially the Slavic ones. 



1 Introduction 



One of the key issues in the present-day corpus processing of Slavic languages is a correct 
(and almost errorfree) automatic morphological disambiguation of corpus texts. While 
for some Western languages the success rate falls within 97.5-98 % (e.g. English, French 
and even Romanian!), for Slavic languages which are generally characterized by a high 
degree of inflection and free word order a correct morphological disambiguation is a 
bottleneck for further development of text and corpora processing of these languages. 
I choose probably the most complex Slavic language - Czech - and 1 show that the 
rich inflection together with strict surface syntax agreement constraints in Czech can 
be very helpful for a successful morphological disambiguation of Czech. The following 
precondition is absolutely necessary here: (a) a deeper insight into the nature of surface 
syntax and morphosyntax of Czech and (b) the abandonment of methodological naivete 
which has characterized the disambiguation of Czech so far. 

First, I shall briefly characterize the state of affairs in the present-day automatic 
morphological disambiguation of Czech and then I shall discuss agreement constraints 
and syntactic rules based on them. 



* The work described is funded by the GACR grant No. 405/96/K214. 
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2 Stochastic and Rule-Based Tagging 

The only methodology used for the automatic morphological disambiguation of Czech 
corpora (cf. [1]) has been the stochastic one so far (cf. [2], [3], [5]). We have already 
criticized this approach in [4], our main criticisms being: 

- stochastic taggers as applied so far cannot cope with the free word-order in Czech: 
two or more word forms can be syntactically related while their word-order distance 
can be virtually arbitrarily long in a sentence 

- the training data approach used by stochastic taggers is not much helpful due to 
extremely many word-order configurations licensed by the free word order system 
of Czech 

- for highly inflective Czech thousands of distinct tags are necessary ==^> stochastic 
taggers for Czech must be trained on huge amounts of disambiguated data which 
are not available 

- stochastic taggers do not model the system of language. 

In [4] we presented the general framework of a rule-based tagger for Czech based on 
sound linguistic reasoning. This tagger being currently developed is driven by surface 
(morpho)syntactic rules which operate on the result provided by a morphological ana- 
lyzer which outputs (generally) ambiguous word forms. The tagger performs two kinds 
of actions: it 

- selects the correct morphological interpretation(s) (encoded in tag(s)) of a word 
form in a sentence 

- removes the incorrect morphological interpretation(s). 

The overall strategy can be expressed in terms of the horror erroris approach which 
consists in maintaining the recall identical to or very close to 100 % with simultaneously 
increasing precision initially equal to 0. Moreover, the “positive” action (selecting the 
proper tag) and the “negative” one (discarding the incorrect tags) are based on the 
unlimited sentential and even broader context. 

3 Agreement and Disambiguation 

Surface syntax of Czech (the following basically holds also for other inflectional Slavic 
languages) is based on the following pillars: rich morphology, free word order, valency, 
agreement, clitic placement. While free word order is an “unpleasant” property of Czech 
for the task under investigation, it is compensated, to a considerable extent, by the 
other characteristics given above. I have, in particular, chosen grammatical agreement 
to elucidate the fact that it can be extremely helpful in the process of disambiguation, 
especially for the development of rules which detect the absence of agreement. Out of a 
very complex agreement subsystem of the Czech surface syntax 1 have chosen two main 
types of agreement in Czech: 

- agreement within the noun group 

- agreement of the subject and predicate verb. 
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3.1 Agreement within the Nominal Group 

The grave errors stochastic taggers commit when they disambiguate morphological val- 
ues within nominal groups can be avoided by means of rules based on relatively simple 
syntactic insights. I shall deal with the left modifiers of the head noun in nominal endo- 
centric constructions. The following disambiguation rule based on the study of corpora 
of contemporary Czech can be formulated. 

General Agreement Rule for Nominal Groups. If the following conditions hold: 

- a sequence SyntAdjGrp formed only by syntactic adjectives SyntAdj (adjectives 
proper, adjectival pronouns, ordinal and generic numerals, cardinal numerals end- 
ing in 1, 2, 3 or 4 but not simultaneously ending in 1 1 - 19), adverbs, commas and 
coordinating conjunctions (all elements in SyntAdjGrp being possibly ambigu- 
ous only within SyntAdjGrp) immediately precedes a part-of-speech unambiguous 
noun N 

- all SyntAdj elements in SyntAdjGrp agree with N in gender, number and case in at 
least one morphological interpretation morphint (i.e. each SyntAdj element and N is 
assigned the same triple formed by respective values of gender, number and case) 



every and only such interpretation morphint is correct (there can be more such correct 
interpretations) and all other morphological interpretations assigned to SyntAdj elements 
and to N will be discarded as incorrect. 

The existence of such a triple of values (gendval,numval,caseval) implies that the 
whole construction is an endocentric one with N being the head of the whole nominal 
group. The correct interpretations are thus determined by a kind of unification during 
which those morphological interpretations which do not belong to the intersection of 
interpretations assigned to SyntAdj elements and N will be removed from all SyntAdj 
elements and from N. The rule given above reflects one of the most typical syntactic 
patterns in the Czech corpora - far more than 99 % constructions having the structure 
specified in the conditions are really endocentric with all the syntactic adjectives from 
SyntAdjGrp modifying N. For instance, in the sentence: 

(1) Byly to nase(PossPron) stare(Adj) dobfe(Adv) vyzkousene(Adj) programy(N) 

(E. lit. It were our old well tested programs) 

we can, in fact, unify all the syntactic adjectives in gender, number and case, and regard- 
less of the vocative case which can be discarded by other rules two realistic interpretations 
remain: masculine inanimate plural nom/acc after many other morphological readings 
(e.g. singular feminine genitive / dative / locative etc.) have been discarded. 

A perfect rule-based approach should, however, never be satisfied with such a plain 
intuition as stated above and it should carefully study all the deviations from this over- 
whelmingly most typical structural pattern. If the rule stated above is used the following 
types of errors may occur: 

- reverse dependency: an adjective governs the following material including N 

- ellipsis 

- nominating nominative and metausage. 
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Reverse dependency can be accounted for by using a valency dictionary. The rules should 
consult such a dictionary for the adjectives which can have valency and among which 
the most “dangerous” (in addition to a very small group of frequent nondeverbative 
adjectives which have valency such as piny (genitive) (E.full of), podobny (dative) (E. 
similar to) are mostly the deverbative ones. If the full-fledged valency dictionary of 
adjectives is not available, such “dangerous” adjectives can simply be excluded from the 
rule (hut then the more constrained context for rule application will result in a smaller 
coverage and hence in a lower precision). 

The ellipsis problem can hardly be solved for it is extremely difficult to reconstruct 
a missing element. Pragmatically, in written corpora there are only very few instances of 
actual ellipsis (the systemic ellipsis must, however, be fully accounted for by the rules, 
if possible), i.e. a decrease in recall is not dramatic. 

The nominating nominative and metausage is also a serious problem which is to 
be solved by a bunch of specific rules nol discussed here. In the first approximation I 
assume that metausage is used primarily after verbs of saying, but no thorough research 
has been performed yet. 

Let us now formulate another intuition concerning nominal groups which makes it 
possible to define a rule wifh almost 100 % correctness. 

Case Identity Rule. If the following conditions are met: 

- SyntAdjGrp is a sequence of syntactic adjectives (of the type given above) followed 
by a noun N 

- the first element First of SyntAdjGrp and N unambiguously share the same case value 
casevaFofJI (either as the immediate result of morphological analysis, or after a 
partial disambiguation hitherto performed) 



for First and for all the elements standing between First and N and specified for case 
(i.e. not adverbs, for instance) case=caseval and moreover: 
ge,nd=gendvaLofJS, mmt=numvaLofJN . 

This rule will - and the corpus evidence does confirm if - subsfantially confribufe 
to almost errorfree morphological disambiguation of nominal phrases, especially those 
embedded in prepositional phrases beginning with the case-unambiguous preposition 
P and ending with a noun N, both P and having the same unambiguous case value. The 
preposition determines the unambiguous case value of the first case-specified element 
First following it. The identical First’s and N’s case value ensures a complete case, gender 
and number disambiguation of the intermediate elements. For instance, the following 
very frequent type of constructions can be disambiguated for case, gender and number 
by the given rule: 

(2) k(Prep.dat.) dobfe(Adv) vychovanym(Adj. dat.pl. / Adj.instr.sg) a(Conj) 
opravdu(Adv) vytecne(Adv) se(PronRefl) ucfcfm(Adj. dat.pl. / Adj.instr.sg.) starsfm 
(Adj.dat.pl. / Adj.instr.sg.) zakum(Noun.dat.pl.masc._anim.) 

(E. lit. to well brought up and really brilliantly studying pupils) 

In this not yet disambiguated prepositional phrase the dative preposition k (E. to) 
imposes dative to the adjective vychovanym (E. brought up). As zdkum (E. pupils) is in the 
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same case, all the other (syntactic) adjectives placed in between, i.e. ucfci'm (E. studying) 
and starsi'm (E. older), will also be in the dative case, their gender (masc._anim.) and 
number (pi) values being identical with the noun zdkum. 

3.2 Agreement of the Subject and Predicate Verb 

In Czech, if the subject is formed by a noun in the nominative case, the predicate verb 
agrees with it in gender, number and person. The obligatory agreement in these three 
categories is extremely helpful for the correct morphological disambiguation of Czech. 
I can demonstrate this by a classical sentence: 

(3) Souhoi(noun,masc. Jnan.sg.) se nepodan\o{verb,neut.sg.) otevfit. 

(E. lit. The file it failed to open.) 

(E. The file failed to open.) 

The masculine inanimate noun soubor (E.file) which has - as a word form as such - 
the nominative and accusative interpretation does not agree in gender with the predicate 
verb nepodafilo (E. failed) which is marked for the neuter gender. Therefore soubor 
cannot be the subject of the sentence, i.e. its nominative interpretation should be dis- 
carded as incorrect (indeed, soubor is the object of the transitive verb otevnt). In the 
context of sentence (3) this reasoning is sound but in Czech the function of the nominative 
case is not limited to subject. If we want to eliminate the nominative reading in similar 
non-agreement constructions as above we must - if we do not want to commit naive 
disambiguation errors - very carefully specify the “dangerous” syntactic phenomena 
which could lead to erroneous disambiguation and to constrain the context of the rule 
application accordingly. Therefore, the following relatively sophisticated limiting condi- 
tions must be imposed to the context so that the nominative interpretation of a would-be 
subject could be eliminated with utmost certainty: 

Nominative Discard Rule Let us formulate the following constraints: 

- let the comparative non-subject constructions introduced by the adverb or con- 
junction jakoyjak, coby (E. as, like), nez, nezli (E. than) not appear in front of a noun 
N which is the candidate for a subject interpretation. 

Such comparative constructions must be excluded because of the constructions of 
the type: 

(4) . . . vzapeti ji jako blesk (Noun,masc. Jnan.) osvi'tilo (VEin,neut.) pozndm 
(E. lit. . . . immediately her like a lightning illuminated the realization . . . E. ... im- 
mediately she was illuminated by a realization as if struck by a lightning) 

Here blesk (E. lightning) does not agree with the finite verb osvi'tilo in gender, but 
the only correct nominative reading of blesk must not be discarded because blesk is 
part of the comparative construction jako blesk and not a subject of (4); 

- let the predicate verb not belong to the copulative and double-nominative verbs 
such as byt (E. be), zddt se (E. seem), pripadat (si) (E. seem), jevit se (E. seem) 
nazyvat (si) (E. call / be called) (this very small group must be fully specified). This 
limitation is motivated by the fact that in Czech the copula and double-nominative 
verbs can agree with the nominal predicate rather than with the subject (in this 
subarea, there exist complex rules in Czech): 
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(5) Ten chlapecek bylo jeste dite. (E. lit. The little boy was still a child.) 

where the subject chlapecek (E. little boy) is masculine animate and the copula bylo 
(E. was) and the noun dite (E. child) are neuter. Discarding chlapecek’ & nominative 
reading would he incorrect; 

- N should not be the nominating nominative because such nominative nouns as 
Anselm in the following type of constructions: 

(6) . . .jmeno Anselm oznacovalo cloveka, ktery . . . 

(E. . . . the name of Anselm denoted a man who . . . ) 

are absolutely correct because the subject in sentence (6) is formed hy the neuter 
Noun jmeno (E. name) which duly agrees with the neuter hnite verb oznacovalo (E. 
denoted). Thus, the correct nominative reading of Anselm cannot be discarded. On 
the whole, the nominating nominative is one of the main problems of the rnle- 
based disambiguation of Czech. A noun N 2 in the nominative case can modify a 
preceding noun Ni which can have an arbitrary case value (N 2 can even be, however, 
formed by a complete nominal group !). The intricate problem consists in recognizing 
such a construction and in determining the true subject in such clauses; 

- N should not be extracted from the dependent clause as in sentence: 

(7a) Soubor bylo nutne, aby byl otevfen. 

(E. lit. The file it was necessary so that it were opened.) 

(E. The file was necessary to open.) 

where soubor is in the nominative case because soubor can be considered as ex- 
tracted from the dependent clause as its nominative subject (of a synonymous un- 
derlying structure) and raised to the main clause. Such an underlying structure could 
have the following shape: 

(7h) Bylo nutne, aby byl soubor otevfen. 

(E. lit. It was necessary so that the file was opened) 

(E. It was necessary for the file to be opened) 

Here soubor is the nominative subject and its raising has not changed its case value. 
Thus, the removal of its nominative reading would be erroneous. 

Such sentences comprising extracted elements are difficult to recognize by shallow 
rules but the main clause verbs which license such extraction form only a relatively 
small set. Such verbs must be explicitly excluded in the condition, i.e. if such a verb 
is encountered in the sentence being processed, the rule does not apply. 

The conditions given above form the first set of necessary conditions for discarding 
the nominative reading of the noun in the position of a would-be subject N. Now I 
shall specify further conditions for the Nominative Discard Rule to apply which concern 
the values of gender (due to the lack of space the other key categories, viz. number 
and person, are not discussed) assigned to the noun N and to the predicate verb V. My 
objective is to specify such gender, number and person relations of N and V which would 
ensure the removal of the nominative reading of N. 

I shall limit my considerations (for the reasons of easier exposition) to N and V stand- 
ing next to each other (both mutual word-order positions are possible). Here N can be in 
the nominative case and the predicate verb Eis a past participle marked for gender. Table 
1 below shows relevant gender value relations of N and V. For gender it is important to 
take into account the hierarchy of gender values which is relevant for the value of the 
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predicate verb in case the subject is formed by coordination. The gender values hierarchy 
in Czech looks as follows: 

1. masculine animate, 2. maculine inanimate, 3. feminine, 4. neuter. 

Now, in addition to the conditions specified above, the last condition of the entire 
Nominative Discard Rule and the resulting discard action will be specified: 
if N and V satisfy gender value relations stated in: 

Table 1. 



Noun N 


Verb V 


masc.animate 

masc.inanimate 

fern 


masc.inanimate and/or fern and/or neut 

fern and/or neut 

neut 



where the conjunction and/or expresses the fact that the Verb’s gender value need not 
be unambiguously disambiguated (e.g. it can still have both masculine inanimate and 
feminine interpretation assigned): 

Vis not a nominative subject in the given noun-verb relation, i.e. the cast=nominative 
interpretation of N can be discarded. 

For instance, the gender value combination implying that soubor in sentence (1) 
cannot be in the nominative case is specified in the 2nd row of Table 1 . 

Let us add that if in a sentence there is a gender value relation of N and V which is 
not specified in Table 1 then other rule(s) - not specified here - must be applied. 

4 Conclusion 

As we have seen, relatively simple rules using agreement and non-agreement patterns 
can be used for the automatic morphological disambiguation of elements involved in 
agreement relations in Czech sentence. These rules substantially contribute to the so- 
lution of some case disambiguation problems in Czech and can, mutatis mutandis, be 
applied to inflectional languages whose declension system is characterized by a similarly 
high degree of case syncretism. 
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Abstract. Allowing users to interact through language borders is an interesting 
challenge for information technology. For the purpose of a computer assisted 
language learning system, we have chosen icons for representing meaning on the 
input interface, since icons do not depend on a particular language. However, 
a key limitation of this type of communication is the expression of articulated 
ideas instead of isolated concepts. We propose a method to interpret sequences of 
icons as complex messages by reconstructing the relations between concepts, so 
as to build conceptual graphs able to represent meaning and to be used for natural 
language sentence generation. This method is based on an electronic dictionary 
containing semantic information. 



1 Introduction 

There are some contexts in the field of information technology where the available data 
is limited to a set of conceptual symbols with no relations among them. In applications 
we have developed, icons are used on the input interface to represent linguistic concepts 
for people with speech disabilities, or for foreign learners of a second language; in 
information extraction or indexing applications, sets of keywords may be given with no 
higher-level structure whatsoever; the same situation may occur in a context of cross- 
linguistic communication where participants in an online discussion forum are able to 
exchange bare concepts through automatic search in electronic dictionaries, but are not 
able to master the syntactical structure of each other’s language. 

The problem in such contexts is that there is no deterministic way to compute the 
semantic relations between concepts; while the meaning of a structured message pre- 
cisely resides in the network built from these relations. Isolated concepts thus lack the 
expressive power to convey ideas; until now, the expression of abstract relations between 
concepts still cannot be reached without the use of linguistic communication. 

We have proposed an approach to tackle this limitation [8]: a method to interpret 
sequences of isolated concepts by modelling the use of “natural” semantic knowledge 
is implemented. This allows to build knowledge networks from icons as is usually done 
from text. A first application, developed for a major electronics hrm, had aimed at 
proposing speech-impaired people an Iconic aided communication software. We are 
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now working at improving the theory in order to implement it in the held of computer 
assisted language learning. Here we present new formalisms to model lexical meaning 
and associative semantic processes, including representation of conceptual inheritance, 
which have been developed for the latter application. 

2 Description of the Problem 

Assigning a signification to a sequence of information items implies building conceptual 
relations between them. Human linguistic competence consists in manipulating these 
dependency relations: when we say that “the cat drinks the milk”, for example, we 
perceive that there are well-dehned conceptual connections between ‘cat’, ‘drink’, and 
‘milk’ — that ‘cat’ and ‘milk’ play given roles in a given process. Linguistic theories 
have been developed specihcally to give account of these phenomena [7,4], and several 
symbolic formalisms in AI [5,6] reflect the same approach. Computationally speaking, 
‘cat’, ‘drink’ and ‘milk’ are: without relations, a set of keywords; with relations, a 
structured information pattern. This has important consequences e.g. in text hltering and 
information retrieval. 

Human natural language reflects these conceptual relations in its messages through 
a series of linguistic clues. These clues, depending on the particular languages, can 
consist mainly in word ordering in sentence patterns (“syntactical” clues, e.g. in English, 
Chinese, or Creole), in word inflection or suffixation (“morphological” clues, e.g. in 
Russian, Turkish, or Latin), or in a given blend of both (e.g. in German). Parsers are 
systems designed to analyze natural language input, on the base of such clues, and to 
yield a representation of its informational contents. 

In the context of language learning, where icons have to be used to convey complex 
meanings, the problem is that morphological clues are of course not available, when at 
the same time we cannot rely on a precise sentence pattern (there is no “universal icon 
grammar”, and if we were addressing perfectly functional speakers of a given language, 
with its precise set of grammar rules, we wouldn’t be using icons). 

Practically, this means that, if we want to use icons as an input for computer commu- 
nication, we cannot rely on a parser based on phrase structure grammar (“CFG”-style) 
to build the conceptual relations of the intended message. We should have to use a parser 
based on dependency computing, such as some which have been written to cope with 
variable- word-order languages [1]. However, since no morphological clue is available 
either to tell that an icon is accusative or dative, we have to rely on semantic knowledge 
to guide role assignment. In other words, an icon parser has to know that drinking is 
something generally done by living beings and involving liquid objects. 

3 Modelling Meaning 

The first step is then to encode the semantic information representing this type of natural 
world knowledge. For this purpose, we develop an icon lexicon where the possible 
semantic relations are specihed by feature structures among which unification can take 
place. However, the feature structures do not have a syntactic meaning here, like e.g. 
in HPSG, but a natural language semantics meaning: Instead of formal grammatical 
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features, it is specified which “natural properties” the different icons should have, and 
how they can combine with the others. 



3.1 Intrinsic vs. Extrinsic Features 

Every icon in the lexicon has a certain number of intrinsic attributes, defining its fun- 
damental meaning elements. Going back to our example, ‘cat’ has the features animal , 
living , while ‘milk’ has the features liquid, food. 

In natural language semantics, some pair of concepts are defined in opposition to 
each other; for the sake of modelling simplicity, we define these pairs as couples of 
features sharing the same attribute but with an opposite value. This modelling choice 
leads to define the basic feature, or intrinsic feature, as a pair (a, v), where the attribute 
a is a symbol, and the value f is +1 or —1. 

Yet intrinsic features are not enough to build up relations: we need at least some 
first-order semantics to allow predication. Hence a restricted set of icons, the predicative 
icons (roughly corresponding to natural language verbs and adjectives), also have sets 
of extrinsic (or selectional) features, that determine which other concepts they may 
incorporate as actants. These extrinsic features specify for example which properties are 
“expected” from the agent or the object of an action, or to which categories of concepts 
a particular adjective may be attributed: in our example, ‘drink’ would have the features 
agent(animal) and object(beverage). 

This could lead to define the extrinsic feature as a pair (c,ef), where the case c is 
a symbol, and the expected feature ef is an intrinsic feature as defined above, i.e. as 
being of the form (c, (a, v)), where case c and attribute a are symbols, and value v is -fl 
or —1. 

However, with such a definition, the selectional effect of an extrinsic feature can 
only be compelling (the attribute is present with a value of -fl), blocking (the attribute is 
present with a value of — 1), or null (the attribute is absent). Yet natural semantics involves 
the ability to represent gradation: in natural language for instance, a given association 
between words may be expected, but it does not completely block the possibility that 
another one be realized. 

So, we decide to define the extrinsic feature as (c, (a,v)), where c (the case) is 
a symbol, a (the attribute) is a symbol, and G M. This way of modelling allows to tune 
the value v in order to make a semantic association more or less compelling. 

The extrinsic features contain all the information about the potential case relations 
that may occur in the icon language. Considering a given predicative icon, its valency 
frame, or case frame, is strictly equivalent to the set of its extrinsic features factorized 
by case. Considering the whole lexicon, the case system is defined by the set of all cases 
appearing in any extrinsic feature of any icon. 



3.2 Feature Inheritance 

There are obvious advantages of including a representation of inheritance in the lexicon, 
such as: saving representation space (‘dog’, ‘cat’, and ‘hamster’ only have a few specific 
features represented separately, the rest is stored under ‘pet’); providing a measure of 
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semantic distance between concepts (how many “common ancestors” do they have, and 
at which level?) 

However, since natural concepts may be grouped in overlapping categories, there can 
be no unique tree-like hierarchy covering the whole lexicon. For this reason, a mechanism 
of multiple inheritance has been developed. 

The multiple inheritance model allows a single concept to inherit thematic features 
from a thematic group, as well as structural features from an abstract superconcept 
spanning different subgroups of the thematic hierarchy (like for instance the superconcept 
‘action’, which passes the extrinsic feature agent (animal) on to all specific concepts 
which inherit from it). Intrinsic features as well as extrinsic features may be inherited, 
and passed on to more specific subconcepts. 

The well-known theoretical problem of multiple inheritance, namely the possibility 
that a concept inherit contradictory features from two separate branches of the inheritance 
graph, is not an actual problem in the context of a model for natural meaning. In fact, 
natural categories are not logical categories, and it is actually normal that contradictions 
may arise. If they do, they are meaningful and should not be “solved”. Specifically, in the 
analysis application described below, feature values are added, so if an attribute appears 
once with a positive value, and once with a negative one, it counts as if its value were 
zero. 

It is important to note that concept labels may be used as attributes in semantic 
features, like when we want to specify that the object of ‘drink’ has to be a ‘beverage’. 
This means that we do not postulate any ontological difference between a feature and 
a concept. As a matter of fact, studies in natural language semantics, for instance, always 
represent features by using words: features simply are more primitive concepts than the 
concepts studied. So, when we say that concept a inherits from concept b (is a subconcept 
of concept b), we mean exactly the same thing as when we say that concept a has the 
feature b, and there is a unique formal representation for all this, like in the example 
below: 





intrinsic features extrinsic features (none in this example) 



An important practical consequence of this is that we can talk of feature inheritance: 
this will be used in the analysis process. 



4 The Semantic Analysis Method 

The icon parser we propose performs semantic analysis of input sequences of icons by 
the use of an algorithm based on best-unification: when an icon in the input sequence has 
a “predicative” structure (it may become the head of at least one dependency relation to 
another node, labeled “actant”), the other icons around it are checked for compatibility. 
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Semantic compatibility is then measured as a unification score between two sets of 
feature structures: the intrinsic semantic features of the candidate actant: 

XT = {(aii,uii), (ai2,wi2), ■ • ■ , (aim,wim)} , 

and the extrinsic semantic features of the predicative icon attached to the semantic 
role considered, the case c : 

ST = {(a21,W2l), (022,^22), • ■ • , (02™, V2n)} , 

(where (c, (021, U21)), (c, (022, ^^22)), •■•,(£, (o2„, U2„)) are extrinsic features of the 
predicate). 

The basic idea is to define compatibility as the sum of matchings in the two sets 
of attribute-value pairs, in ratio to the number of features being compared to. Note 
that semantic compatibility is not a symmetric norm: it has to measure how good the 
candidate actant (i.e. the setliF) fits to the expectations of a given predicative concept in 
respect to its case c (i.e. to the set ST). Hence there is a filtering set (ST) and a filtered 
set (IT). The asymmetry shows itself in the following definition of the compatibility 
function, in that the denominator is the cardinal of ST, not of IT: 

C(IT,ST) = C({(aii, Vii), . . . , {(021,^21), ■ • ■ , {a 2 n,V 2 n)}) 

Sje[l,n] X/ie[l,m] 

•) 

n 

where / is a matching function defined on pairs of individual features, not on pairs of 
sets of features. 

Now the compatibility function / has to be defined at the level of the features 
themselves so as to take into account the inheritance phenomena. So we define 

/((ai,Ui), (02,V2)) 

(where (ai,ui) is the intrinsic [filtered] feature, and (02, V2) the extrinsic [filtering] 
feature), as following: 

1 - If the two attributes are the same (oi = 02 = a ): 

f{{a,vi), (a,V2)) = V1.V2 ; 

2 - if oi 02 (oi includes 02 in its signification, i.e. oi is a subtype of 02 ): 

- ifui<0, f{{ai,Vx),{a 2 ,V 2 )) = Q, 

- ifui>0, f{(ai,vi),{a 2 ,V 2 )) = VI.V 2 ', 

3 - if oi ^ 02 (oi includes a feature a'2 in its signification, such that a'2 is contradictory 

with 02 ): 

- ifui<0, f({ai,Vi),(a 2 ,V 2 )) = Q, 

- ifui>0, /((oi,Ui), (02,^2)) = -U1-W2 ; 

4 - if oi 02 , and 01^02, and ai^02, then: 

- either 02 is a primitive feature (^x | 02 => a; ), in which case: 

/((oi,ui), (02, U2)) =0, 

- or 02 is decomposable in more primitive features; and then: 
let {o2i , 022, • ■ • , CL2k} the set of features implied by 02 

(o2 ^ 02j for j G [1,/c] ) 
then 

/((oi,ui), (02, U2)) = 

C({(oi,ui)},{(o2i,W2), (o22, V2), ■ • ■ , (a2k,V2), (dummy_symbol , U2) }) . 
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Let us explain and illustrate this definition by simple examples. Suppose we want 
to test whether some icon possessing the feature dog ((dog, 1)) is a good candidate 
for being the agent of the verb ‘bark’; ‘bark’ having an extrinsic feature agent(dog) 
((agent, (dog, 1)) ). We will then be trying to evaluate /((dog, 1), (dog, 1)) . This 
is the case 1, and the result will be 1. If we had tried to match this same icon to a verb 
whose agent should not be a dog ((agent, (dog, —1)) ), the result would of course 
have been —1. 

Now suppose we want to match dog to a verb which only expects its agent to be 
an animal. We will have to evaluate /((dog, 1), (animal, 1)) . dog being a subtype of 
animal , we have dog ^ animal , so we are in the case 2, and the result is 1 (a dog 
fulhlls entirely the expectation of being an animal). 

If on the other hand we wanted to match some concept of which we know it is not 
a dog, because it has the feature (dog, — 1) , to the semantic role where an animal is 
expected, we could obviously draw no conclusion from the only fact that it is not a dog. 
Not being a dog does not imply not being an animal. This is why in this particular subcase 
of case 2, the result is 0. 

Now if we want to match dog to some semantic role where an object is expected, 
we hnd that dog living_being , and object and living being being mutually 
exclusive, we are in the case 3 and hnd the value —1. 

Like in case 2, there is a subcase of case 3 where the result is 0 because no conclusion 
can be drawn (e.g. we can not deduce from something not being a dog that it is a non 
living object). 

Finally let us suppose that we want to match some animal which is not a dog to the 
agent role of ‘bark’, which expects dog . The candidate concept does not possess the 
feature (dog, 1) but it possesses the feature (animal, 1) . It would be inappropriate, in 
this case, that this concept should have no better score than any other: being an animal, 
it is semantically “closer” to dog than an inanimate object, for example, would be (this 
is what allows, in natural language semantics, sentences like “the police superintendent 
barks” [2]). 

This is why, in this case, we break up dog into more primitive components and 
recursively call the function C (compatibility on sets of features), so that (animal, 1) 
will eventually meet (animal, 1) , and will yield a positive, though fractional, result. 

A dummy feature is added so that the compatibility value loose a small proportion 
of itself in this operation of breaking up, by incrementing the denominator. 

Note that the recursivity (C is based on / and / is — partially — based on C ) is not 
infinite, since the decomposition always falls back on primitive features: there is no 
infinite loop. This is guaranteed, not by the dehnition of the functions themselves, but 
by the fact that the inheritance graph is a direct graph. 

Globally, for every predicate in the actual input sequence, the analysis process seeks 
to assign the best actant for every possible role of the predicate’s immanent conceptual 
structure. The absolute compatibility between the predicate and the actant, defined in 
the sense of the function C described above, is weighted by a function valued between 
0 and 1 and decreasing with the actual distance between the two icons in the sequence. 

The result yielded by the semantic parser is the graph that maximizes the sum of the 
compatibilities of all its dependency relations. It constitutes, with no particular contextual 
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expectations, and given the state of world knowledge stored in the iconic database in the 
form of semantic features, the “best” interpretation of the users’ input. 



5 Application and Evaluation 

A primitive version of the semantic analysis algorithm has been implemented in 1996 
for rehabilitation purposes, within a French electronics firm (Thomson-CSF), in the 
frame of a software communication tool for speech-impaired people [8]. The evaluation 
led to acceptable performance in analysis accuracy (80.5 % of the sequences correctly 
analyzed on a benchmark of 200 samples). However the acceptance level by the user 
remained low, due to a strongly time consuming recursive algorithm (the complexity 
and time grew in a 0(n.e") relation to the size of the input). 

An application to the field of CALL (Computer Assisted Language Learning) is cur- 
rently being developed at the Humboldt University of Berlin. The application prototype 
aims at allowing learners of German as a second language to practice communication 
in that language at home or in tutorial classes. The users hrst tell the computer what 
they intend to express by pointing to icons. The system interprets these icons semanti- 
cally, and proposes a choice of rated formulations (1) in the form of conceptual graphs, 
and (2) as full German sentences. The users are then allowed to “play” with the graph 
to discover how to express variations or refinements, in particular concerning nuances 
in verbs like expressed in Kunze’s theory of verb helds [3]. This application is made 
possible by mapping the results of the semantic analysis into a lexical database of the 
German language developed by the Chair for Computational Linguistics at the Humboldt 
University (Figure 1). 

The implementation principles have been renewed in this application, so as to de- 
velop a form of parser storing its intermediate results (inspired by “chart parsers” for 
CFG grammars). This allows considerably less backtracking, and hence a big gain in 
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Fig. 1. Structure of the CALL system 
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computational complexity (now measured in 0{n^)), and removes one of the major 
impediments of the method. 
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Abstract. Learning Bayesian Belief Networks from corpora has been applied 
to the automatic acquisition of verb subcategorization frames for Modem Greek 
(MG). We are incorporating minimal linguistic resources, i.e. morphological tag- 
ging and phrase chunking, since a general-purpose syntactic parser for MG is cur- 
rently unavailable. Comparative experimental results have been evaluated against 
Naive Bayes classification, which is based on the conditional independence as- 
sumption along with two widely used methods, Log-Likelihood (LLR) and Rela- 
tive Erequencies Threshold (RET). We have experimented with a balanced corpus 
in order to assure unbiased behavior of the training model. Results have depicted 
that obtaining the inferential dependencies of the training data could lead to a preci- 
sion improvement of about 4% compared to that of Naive Bayes and 7% compared 
to LLR and RET Moreover, we have been able to achieve a precision exceeding 
87% on the identification of subcategorization frames which are not known be- 
forehand, while limited training data are proved to endow with satisfactory results. 



1 Introduction 

The detection of the set of syntactic frames, i.e. syntactic entities a certain verb subcat- 
egorizes for is important especially for tasks like parsing and grammar development. 
It provides the parser with syntactic information on a verb’s arguments, i.e. the set of 
restrictions the verb imposes on the syntactic constituents which are required for the 
meaning of the verb to be fully expressed. Machine-readable dictionaries listing sub- 
categorization frames (SF) usually give only expected frames rather than actual ones 
and are therefore incomplete, or not available for some languages, including Modern 
Greek. By acquiring frames automatically from corpora, these problems are overcome 
altogether. 

Previous work on learning SF automatically from corpora focuses mainly on English 
[2], [12], [3], [7]. [1] deal with Italian, [6] with German, [12] with Japanese, [14] with 
Czech. In most of the above approaches, the input corpus is fully parsed and in some of 
them only a small number of frames are learned [2], [12]. 

For the present work, we apply Bayesian Belief Networks learning from corpora and 
use the extracting network as an inference tool that enables automatic acquisition of SF 
for Modern Greek. The complete set of frames for a particular verb is not known be- 
forehand but it is learned automatically through the training process. As a starting point 
for parameter selection, we chose the parameters required by the Log-Likelihood Ratio 
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(LLR) statistic [5]. We evaluate the experimental results against Naive Bayes classifier 
and study the influence of the conditional independence assumption on the overall preci- 
sion. Furthermore, two common methods that were used in previous research were also 
applied to provide a more complete set of conclusions. Bayesian inference outperforms 
all the other techniques, while, by introducing three supplementary parameters, precision 
increases considerably. lOfold cross validation is used in order to retrieve results. 

MG is a ’free word order’ language. Position of the syntactic constituents in a sen- 
tence is a very weak indicator of the syntactic role of the constituent. Moreover, the 
existence of adverbs in the neighborhood of a verb is a major source of noise for the task 
at hand, as is indicated by our experimental results. 

The corpus used for our study is the balanced ILSP/ELEFTHEROTYPIA Greek 
Corpus' (consisting of 1.6 million morphologically tagged words of political, social and 
sports content, taken from a wide circulation newspaper). 

This paper is organized as follows: Initially, we present certain linguistic properties 
that are relevant to the task of SE acquisition, which characterize MG and need to be taken 
into consideration (section 2). The corpus preprocessing stage as well as the detection of 
the environments of a verb are covered next in the same section. The Bayesian techniques 
applied to the frame detection task are presented one by one along with the feature sets 
selected for them (sections 3 and 4). The results obtained by these methods are tabulated 
in section 5, with a summary of concluding remarks in the last section. 



2 Corpus Preprocessing 

MG is a ’free word-order’ language. The arguments of a verb do not have fixed positions 
within the sentence and they are basically determined by their morphology. Noun phrases 
(NP), prepositional phrases (PP), adverbs, secondary clauses and combinations of these 
constituents may function as arguments to verbs. Weak personal pronouns (WPP) may 
also function as arguments to verbs when occurring within the verb phrase either in 
the genitive or accusative case. Given that a wide-coverage syntactic parser for Modern 
Greek is not yet available, limited linguistic resources have been utilized in all our tech- 
niques for frame detection. Thus, preprocessing of the corpus consisted of the following 
tasks: basic morphological tagging, chunking [15], detecting the headword of the noun 
and prepositional phrases, i.e. the word the grammatical properties of which are inher- 
ited by the phrase and filtering out constituents that do not contribute to the task, like 
abbreviations. 

Morphological tagging includes part-of-speech tagging for all words, case tagging 
for nouns, adjectives and pronouns, voice tagging for verbs, type tagging for verbs (dis- 
tinguishing between personal and impersonal verb types), type tagging for pronouns 
(distinguishing among three types of pronouns: relative, interrogative and other) and 
type tagging for conjunctions (distinguishing between coordinate and subordinate con- 
junctions). 

The chunker is based on minimal resources, i.e. a small keyword lexicon contain- 
ing some 450 keywords (closed-class words) and a suffix lexicon of 300 of the most 

' Copyright 1999, ILSP. http ; / / corpus . ilsp . gr 
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common word suffixes in MG. It detects the boundaries of intrasentential noun phrases, 
prepositional phrases, verb phrases, adverbial phrases and conjunctions. 

In our approach, a verb occurring in the corpus in both the active and passive voice 
is treated as two distinct verbs. The same applies when the same verb appears with both 
a personal and an impersonal structure. 

2.1 Detecting Verb Environments 

We have carried out a number of experiments concerning the window size of the envi- 
ronment of a verb, i.e. the number of phrases preceding and following the verb. Windows 
of sizes (-2-1-3), i.e. two phrases preceding and three phrases following the verb, (-2 h- 2) 
and (-1 -h 2) were tested. For almost every environment, not the entire environment, but 
a subset of the environment is a correct frame of the verb. Therefore all possible sub- 
sets [14] of the above environments were produced and their frequency in the corpus 
recorded as shown in Figure 1. Large subsets are more infrequent and more likely to 
contain adjuncts and are replaced by smaller, more frequent subsets. 
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Fig. 1. Subsets and their counts of environments A (in plain font) and B (in bold). By calculating 
every permutation of environments A and B respectively, we obtain the frequency of occurrence 
of every subset (shown in parentheses). As a last step, frequencies of subsets of both environments 
are added. 



3 Bayesian Approaches 

Two Bayesian approaches were applied in order to study and compare their perfor- 
mance on the training data. Bayesian Belief Networks (BBN) and Naive Bayes (NB) 
are two fundamental methodologies that depend on different theoretical assumptions. 
Our objective was to determine which assumption achieves better results In the verb 
subcategorization framework. 

3.1 Bayesian Networks 

A Bayesian Belief Network (BBN) is a significant knowledge representation and rea- 
soning tool, under conditions of uncertainty. Given a set of variables 



Z2=<Xi,X2,...,Xn > 
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a BBN describes the probability distribution over this set of variables. We use capital 
letters as X,Y to denote variables and lower case letters as x,y to denote values taken 
by these variables. Formally, a BBN is an annotated directed acyclic graph (DAG) that 
encodes a joint probability distribution. We denote a network B as a pair B=< G,0 > 
[13], here G is a DAG whose nodes symbolize the variables of D, and 0 refers to 
the set of parameters that quantifies the network. G embeds the following conditional 
independence assumption: 

Each variable Xi is independent of its non-descendants given its parents. 0 includes 
information about the probability distribution of a value xi of a variable Xi, given the 
values of its immediate predecessors. The unique joint probability distribution over 
D=< Xi , X 2 , . . . , Xn > that a network B describes, can be computed using: 

N 

Pb{Xi,...,Xm) = Y\_P{xi\parents{Xi)) ( 1 ) 

i=l 



Learning BBN from Data During the process of efficiently detecting the segment 
boundaries, prior knowledge about the impact each feature has on the classification of 
a candidate boundary as valid or not, is not straightforward. Thus, a BBN should be 
learned from the training data provided. Learning a BBN unifies two processes: learning 
the graphical structure and learning the parameters 0 for that structure. In order to seek 
the optimal parameters for a given corpus of complete data out, the empirical conditional 
frequencies extracted from the data is directly used [4]. The selection of the variables 
that will constitute the data set is of great significance, since fhe number of possible 
nefworks fhat could describe fhese variables equals fo (2) [10]. 

The following equations along with Bayes theorem are used to determine the relation 
r of two candidate networks Bi and B 2 respectively: 



P{^ 

P{B2\D) 



( 3 ) 



( 4 ) 

where: P{B\D) istheprobability of a network B given data D, P{D\B) is the probability 
the network gives to data D, P{D) is the ’general’ probability of data and P{B) is the 
probability of the network before seen the data. 

Having not seen the data, no prior knowledge is obtainable and thus no straight- 
forward method of computing P{Bi) and P{B 2 ) is feasible. A common way to deal 
with this is to assume that every network has the same probability with all the others. 
Applying equation (3) to (4), becomes: 



P{D\B2) 



r = 



( 5 ) 
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The probability the model gives to the data can be extracted using the following 
formula [8]: 



where: 



n qt 

P{D\B) = l[i[ 

i=ij=i 






(6) 



• r is the gamma function. 

• n is the number of variables. 

• Ti is the number of values in i:th variable. 

• Qi is the number of possible different value combinations the parent variables can 
take. 

• Nij depicts the number of rows in data that have j :th value combinations for parents 
of i:th variable. 

• Nijk corresponds to the number of rows that have k:th value for the i:th variable 
and which also have j:th value combinations for parents of i:th variable. 

• S' is the equivalent sample size, a parameter that determines how readily we change 
our beliefs about the quantitative nature of dependencies when we see the data. In 
our study, we follow a simple choice inspired by Jeffrey’s prior [9]. S equals to the 
average number of values variables have, divided by 2. 



Given the great number of possible networks produced by the learning process, 
a search algorithm has to be applied. We follow ’greedy’ search with one modification: 
instead of comparing all candidate networks, we consider investigating the set that re- 
sembles the current best model most. In general, a BBN is capable of computing the 
probability distribution for any partial subset of variables, given the values or distribu- 
tions of any subset of the remaining variables. Note that the values have to be discretised, 
and different discretisation size affects our network. Using a BBN, we are able to infer 
the boundary class given all or a subset of its parent variables. 



3.2 Naive Bayes 

In contrast to BBN, the Naive Bayes classifier utilizes the hypothesis that the values of 
the variables < Xi , X 2 , ■ ■ ■ Xj^ > are conditionally independent given a target value 
V. Having provided a set of training examples, when a new instance, described by the 
tuple of variable values < xi,x2, . . . ,xn >, is presented, NB classifies it predicting its 
target value Vnb using the following equation: 



Vnb = argmax P{vj) TT P{xt\vj) (7) 

Vj£V 

i—1 

where vj corresponds to a possible target value taken from a finite set "V. The terms 
P{vj) and P{xi\vj) are estimated by calculating their frequency over the training data. 
More particularly, for the segments detection task, Vj takes two values, true for correct 
boundary and false for incorrect. Thus, P(true) (or P(false)) is the number of correct (or 
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incorrect) instances divided by the number of instances in the training data. Correspond- 
ingly, P{xi\vj) is the number of times value Xi occurred with target value Vj. For that 
reason, training time is vastly less than that of BBN, where the parameters of a network 
had to be learned first followed by searching over a large space of candidate networks 
for the best model. 



4 Benchmark 

Two additional methods were used as a benchmark for the task of identifying SF, both 
of which were tested in previous research. 

Making the hypothesis that the distribution of verb environment in the corpus is 
independent of the distribution of the verb we can use the Log Likelihood Ratio (LLR) 
statistic in order to detect frames highly associated to verbs by comparing the frequency 
of co-occurrence of a given environment with a certain verb probability to the frequency 
of its co-occurrence with the rest of the verbs and to its expected frequency in the input 
data. The candidate SF whose LLR exceeds a specific threshold value tend to be valid 
SF. On the contrary, those with lower LLR tend to behave as adjuncts. 

The Relative Frequencies Threshold (RFT) is based on the relative frequencies of 
the candidate SFs. Upon completion of the extraction of SFs, a ranking according to the 
probability of their occurrence with the verb was done. The probabilities were estimated 
using a maximum likelihood estimate from the observed relative frequencies. A thresh- 
old, determined empirically, was applied to these probability estimates to filter out the 
low probabilities entries for each verb. 



5 Experimental Results 

Since a valence dictionary for MG does not exist, it is theoretically impossible to deter- 
mine with objectivity the entire set of frames that each verb can take. An objective recall 
value is therefore very hard to obtain. For this reason, our recall value was extracted 
by asking a linguist to provide the entire set of frames for 47 verbs which frequently 
occurred in the corpus. Denote this frame set as S '47 and a verb belonging to this set as Vj, 
we tested our methods against this set. Recall and precision are given by the following 
relations: 



r = Correct{Vi) / Length{Vi) 


( 8 ) 


p = Correct(S) / Length(S) 


(9) 



where Correct(Vj) corresponds to the number of frames for verb Vj that are correctly 
identified and Length{Vi) is the total number of frames Vj can take. Correct(S) is the 
number of correctly classified instances found in test set S and Length{S) corresponds 
to the total number of instances. Both metrics were extracted by using 10-fold cross 
validation. 

The primary training set used to learn a BBN was constructed by manually tagging 
the correct frame class for 4700 instances from the ILSP corpus (after formulation of 
the subsets), with a window size of -2-t-3. 
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In each experiment two types of input data have been tested: a complete training 
corpus and a training corpus where all adverbs have been omitted. As adverbs in MG tend 
to behave far more like adjuncts than arguments, precision increased in the second case, as 
expected. BBN outperforms all other methods on the same feature set by an average factor 
of 4.7 % (taking 150 value intervals when discretising continuous variables). Concerning 
the conditional independence assumption, taken by Naive Bayes, it is observable that 
BBN performs about 4% better. This difference provides a good reason why in verb 
subcategorization detection, one should take the selected features into consideration. 
However, we emphasize that training time in Naive Bayes is about 100 times less than 
that of BBN. 

Moreover, maximum performance is obtained when selecting as features LLR, NoDE 
and the environment, on the -2+2 window size without taking adverbs into account. 
The results in Table 1 were obtained using this feature set. Again, BBN’s difference 
in performance exceeds 3.5% compared to the rest of the techniques. Furthermore, by 
taking adverbs into account, performance drops by approximately 5% in average. RET 
gives some interesting results showing that environments occurring in the data with the 
highest frequency are often correct frames. 



Table 1. Precision and recall values respectively for the various window sizes and algorithms 



Without adverbs 


Window 


BBN 


LLR 


NB 


RFT 


-1 + 2 


87.4- 67.2 


80.3 - 65 


83.7- 64.1 


85 - 65.1 


-2 + 2 


89.9 - 70.2 


82.6 - 67 


85.9 - 67.3 


85.6 - 66.7 


-2 + 3 


88 - 67.7 


82.1 - 66.8 


83.2 - 63.2 


84.2 - 64.9 


Including adverbs 


Window 


BBN 


LLR 


NB 


RFT 


-1 + 2 


81.8-65.2 


79.3- 63.6 


78 - 63.2 


81.1 - 61.9 


-2 + 2 


86.3 - 69 


77.8 - 62 


83.7- 63.8 


82.8 - 65.9 


-2 + 3 


84- 68.5 


80.1 - 64 


80.4- 63.1 


81.5-62.3 



6 Conclusion 

In the present paper a set of well-known machine learning methods have been applied 
to the task of automatic acquisition of verb subcategorization frames. Using minimal 
linguistic resources, i.e. basic morphological tagging and robust phrase chunking, verb 
environments were identified and every environment subset was formulated. Bayesian 
Network learning and the Naive Bayes algorithm were applied to the task at hand, 
using a balanced corpus and performance was compared to the performance of the Log- 
Likelihood Ratio statistic and the relative frequencies of the frames. Bayesian inference 
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seems to outperform all other approaches by approximately 4.7% except for RFT. Adopt- 
ing the assumption that features are considered conditionally independent lessens the 
performance but improves training time. Various window sizes were experimented with 
and performance with a window size of -1h-2 was slightly worse than with that of -2h-2 
and -2 h- 3. It is also been demonstrated that adverbs appearing in verb environments are a 
significant cause of noise, since when they are included in the candidate environments, 
precision is reduced. New frames not known beforehand were learned throughout the 
training process. 
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Abstract. If a corpus is submitted to a morphological analysis, there always 
remain some words that the analyser could not recognize (foreign names, mis- 
spellings, ...). However, if a human reads the texts, he usually understands them, 
even if he does not know as many words as there are in the lexicon used by the mor- 
phological analyser. The language itself helps him to recognize unknown words. 
It is not only semantics or syntax but also pure morphology of unknown words 
that can contribute to their understanding. 

In this article, I describe a “guesser” that can lower the amount of unrecognized 
words after the “classical” morphological analysis of the Czech texts. It was tested 
on the Czech National Corpus. 



1 Introduction 

The automatic morphological analysis of Czech texts is processed on the basis of a large 
lexicon (about the morphological analysis of the Czech see [2]). However, there are still 
quite a lot of words in texts of the Czech National Corpus (CNC) that are not recognized 
by the morphological analyser. 

A relatively great amount of unrecognized words in the Czech National Corpus 
(approximately 2.3 %) causes a lot of problems during the morphological and mor- 
phosyntactic disambiguation of Czech sentences. That’s why I tried to find a method, 
which would help to lower this rate. In fact, for the disambiguation (and for many other 
analyses) we do not need to recognize the words fully. It is sufficient to recognize the val- 
ues of some of their morphological categories. Sometimes only part of speech will do. In 
other words we need a guesser that would assign unknown words as many morphological 
values as possible. (About a guesser for French - see [3].) 

The Czech has a very nice feature - morphology of a word is “visible” by its ending. 
It can be of various lengths, but generally, the last four letters of a word are in a great 
majority of cases sufficient for guessing the most important morphological values - 
see [1]. 

On the other hand, the Czech has a very ugly feature - many word forms are am- 
biguous. One word form can have not only 2, 3, 4 or for some words even 5 parts of 
speech, but even many more combinations of case, number, gender, etc. That’s why the 
morphological analysis assigns to many words more than one tag. 
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2 The Guesser 

To construct the guesser 1 try to employ as many specific features of Czech words as 
possible. The most important one has been already mentioned - the last four letters of 
a word form are usually enough to recognize its basic morphological categories, namely 
part of speech, gender, number, case, verbal tense and person, and for specific parts of 
speech some other. This feature is common to more Slavonic languages. 

We need to have the complete list of all possible endings together with the tags that 
can be assigned to them. Generally, it is not easy to determine, where the (morphological) 
ending of an unknown word begins. That’s why I will call these endings segments in the 
following sections. The term segment is borrowed from another project - MOZAIKA 
(see [1]), which dealt with a similar problem. I have to stress that segments need not 
coincide with the morphological endings of words. If I needed to refer to an end of a 
word as a part of the word, I will call it tail to distinguish it from its morphological 
ending. 

To summarize the terminology I am using: 

Ending is a morphological ending of a word form (e.g. ing in the word using, many 
English words have none , e.g. problem). 

Segment is a string of letters that can occur at the end of a word form (e.g. g, ng, 
ing, sing - they can occur at the end of many words) together with all the possible 
morphological tags assigned to them. 

Tail is an end string of a particular word (e.g. e, ge, age, uage for the word language). 

I take into consideration segments and tails of the length 2, 3 or 4 only. Segments of 
the length 1 - a single letter - usually have too many possible tags. 

For building the list of segments we would need a list of all Czech word forms, 
perfectly morphologically tagged. Such a list does not exist. However, the Czech National 
Corpus is big enough (100 mil. words) for us to obtain a very large list of segments, so 
I used it. 

I have taken all the recognized words of the corpus. For every word longer than 
4 letters I took its 4-letter tail together with the set of all the morphological tags that 
the analyser assigned to the word. I obtained the file of segments. I grouped all these 
segments together. The resulting file contains for every segment all the possible tags that 
were found in the corpus. 

If the corpus is big and representative enough, all the combinations are covered. 
As we will never be sure that the list is complete (and it will never be so), we should 
continually add all the new combinations that will occur during the gathering of texts. The 
contribution of every new text is smaller and smaller, if we reach a reasonably extensive 
file of segments - see Figure 1 . During creation of the list of segments from the Czech 
National Corpus I registered size of the list after addition of every new text. The result 
is a curve depicting speed of growth of the number of segments. At the horizontal axis 
there is number of words processed, at the vertical one total number of segments created 
on the basis of the words processed. 1 used a set of approximately 10 000 segments as the 
“nest-egg”. It is visible, that number of segments increases steeply from the beginning 
(it would be even better visible, if I used a smaller nest-egg). As the number of words 
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Additions of new segments 




Number of words processed 



Fig. 1. Size of the list of segments depending on number of words processed 



processed rises, there is not so many new segments, so their number does not increase 
so rapidly as from the beginning. 

It is desirable to optimise the set of segments. This lowers the time needed for trying 
to match tails of unrecognized words with the segments. For some tag combinations it 
is sufficient to reduce the appropriate segment to 3, or even 2 letters. For instance there 
are several tens of segments with the same set of tags that end with two letters -”uv” - 
and there is no other possibility of tagging the word forms having this tail. 

The reduction to 1 letter is not convenient. There is no letter for the Czech language, 
that would tell without ambiguity anything about the grammatical categories of the word. 
We would have to involve a great set (tens) of tags into the segment to be sure that we 
have covered all the possibilities. Listing many alternatives, the guesser would be too 
weak. 

The optimization can be processed automatically, if we have the set of possible 
shorter segments. All the segments with the same end having the same set of tags are 
expelled and replaced by the shorter segment. Only those, which had other possible 
tags, remain, but without the common tags. After the optimization, the list of segments 
decreased considerably. 

When the file of segments was ready, I used it for assigning morphological tags to 
unknown words from the corpus. The algorithm is simple: 

1. Try to match all the segments to a tail of every unrecognized word. 

2. If the segment is the same as the word tail, assign to the word all the tags belonging 

to the segment. 

It is possible that a word tail matches more segments, because of different segment 
lengths. In such cases I assigned all the tags from all matching segments to the word. 

For guessing morphological values we can employ other features of Czech words, 
too. For instance, the word can’t be in negative form, if it does not begin with the syllable 
“ne”. Other rules concern degrees of adverbs and adjectives. The degree can’t be 3, if the 
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word does not begin with the syllable “nej”. The second and third degrees of adjectives 
must end with the letters “*f “*iho”, “*ich”, “*imi”, where the asterisk 

* stands either for “c” or “s”. These features are typical only for the Czech language. 
However, every language has probably its own specific features that could be used. 



3 Types of Unrecognized Words 

We can divide the unrecognized words into several categories: 

1 . proper names 

2. misspellings 

3. special terms 

4. unusual abbreviations, signs, and similar rubbish 

3.1 Proper Names 

Proper names form the biggest part of the group of all unrecognized words. In the CNC 
there are 2,286,37 1 word forms with unrecognized morphology, 1 ,556,832 of them (68%) 
begin with a capital letter. Some of them can be unrecognized words of other categories, 
which incidentally occur at the beginning of a sentence, but almost all of them are proper 
names or parts of them. They are usually foreign names, because common Czech names 
were in a great majority recognized by the morphological analyser. 

Foreign proper names can have very different tails. Sometimes they have Czech 
endings added to a foreign root. In these cases the guesser described above perfectly 
identifies the part of speech and other morphological values. Nevertheless, some foreign 
proper names have the same tail as one of the Czech segments only incidentally. It would 
be confusing to assign tags to them only according to our file of segments. 

For checking the nature of unrecognized words with a capital letter on the beginning, 
I randomly selected 200 words and assigned part of speech to them. They were mostly 
nouns. Some of them were words of a foreign language that were parts of names of a 
restaurant, a company, a film efc. Even if some of fhem were nol nouns in fheir language, 
I could counf them as nouns, because in fact they were parts of a designation (for instance 
the word ’’meets” in “Jazz Meets World”, or “orange” in “Orange Book”). 

There were only two words among the 200 that belonged to another part of speech 
than noun. These two words were adjectives created by typical Czech endings from 
proper names (Chaninym, Volodarskem). On the basis of this small investigation, I 
made the following decision. Every proper name gets the morphological tag of noun 
with all the possible values of the gender, number and case. If it has a tail that matches 
some of the segments, all the tags belonging to that segment are assigned to the word, 
too. Altogether, a word that begins with a capital letter gets morphological tags of nouns 
and, also, all the tags that conform to its tail, if any. Thus, I can be (almost) sure, that all 
the possibilities are covered. 

3.2 Misspellings 

To deal with misspellings is easy, if the wrong place is not among the last 4 letters of the 
word. If it occurs there, it is probably not possible to assign the right morphological tag. 
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A solution could be a spell-checker. 1 haven’t made any experiment in this held so far, 
but I am afraid that it couldn’t he performed purely automatically. It would be probably 
necessary for a human to decide at least in those cases, when the spell-checker offers 
more than one alternative. 

3.3 Special Terms 

Special terms are very often taken from other languages - mainly from English or Latin. 
Mostly, however, they are handled according to Czech grammatical rules. It means above 
all, that they get Czech endings. Therefore, they are easy to recognize for the guesser. 

3.4 Rubbish 

The fourth category of unrecognized words is difficult to deal with in another way than by 
adding new abbreviations, signs etc. to the lexicon used by the morphological analyser. 

4 Results 

The list of segments automatically drawn from the CNC contains 32,735 segments. 

I used it to guess unknown words from two experimental small corpora. The hrst 
one consists of 9 issues of the newspaper Mlada fronta Dnes, the second one is one issue 
of a popular scientific journal Vesmir. 

There were 3,935 unrecognized words in the newspaper corpus (2.7% of the corpus 
size, which is slightly more than the average in the whole corpus — 2.3%). In the journal, 
there were only 733 unrecognized words (1.9%). Number of unrecognized words differs 
quite a lot between the two corpora, prohahly due to a greater number of foreign names 
in the newspaper than in the specialized, though popular journal. 

Using only the list of segments derived from the whole corpus, the guesser assigned 
morphological values to more than 60% of unrecognized words of the newspaper, in 
the case of the journal it was even more than 70%. If I added the heuristic rule of the 
capital letters, there rested only less than 15% of unrecognized words in each corpus. 
The detailed results are shown in the Figure 2. 



Table 1. Results of the morphological guesser 



Corpus 


Mlada fronta Dnes 


Vesmir 


Number of words 
Unrecognized 
Unrecognized capital 
Recognized (1) 
Recognized (2) 
Unrecognized (2) 


148,459 

3,935 (100.0%) 
2,542 (64.6%) 
2,477 (62.9%) 
3,382 ( 85.9%) 
553 ( 14.1%) 


38,813 

733 (100.0%) 
397 (54.2%) 
532 (72.6%) 
639 ( 87.2%) 
94 ( 12.8%) 


(1) ... on the basis of the tails only 

(2) ... after addition of the heuristic with the capital letters 
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5 Correctness 

The first three categories were recognized well, which means that the resulting tags 
assigned to unrecognized words hy the guesser contained the right tag. The only category 
that was not repaired correctly, is the fourth one - rubbish. Some abbreviations have the 
same tails as “normal” words (not abbreviations) and this can’t be solved by any guesser. 
There is only one way - enlarging the list of abbreviations, as was mentioned above. 
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Abstract. The problem of definite description reference is solved in this paper as 
a clustering task. We present a clustering algorithm based on semantics to clas- 
sify definite description (anaphoric vs. non-anaphoric). Moreover, this clustering 
method helps to look for the correct antecedent using a resolution algorithm based 
on constrains and a preferences. The preference system uses a weight manage- 
ment to select the correct candidate belonged to the same cluster as the definite 
description. The EuroWordNet’s ontology provides the semantic class of definite 
description used to activate the correct cluster. Finally, we present some experi- 
ments related to this procedure. According to these experiments, a precision of 
88.8% and a recall of 83.9% was obtained for definite description treatment and 
resolution. 



1 Introduction 

Definite descriptions are the most usual grammatical expressions used to refer to a person, 
and object, an event, etc. All noun phrases used to describe the same specific concept, 
the same entity of real world, will be related or closed in some way. Instead of pronouns, 
definite descriptions present two main problems: a) a far distance between antecedent and 
anaphoric expression, b) definite description can have referential properties or not. For 
this reason, the majority of algorithms for definite description coreference develops two 
task: classification and resolution. In this paper, we present a mechanism that combines 
a clustering technique to classify the definite description (DD) as anaphoric or not, 
and an algorithm to provide the correct antecedent applying a set of constrains and 
preferences. These constrains and preferences use different kind of information, such 
us, morphological, syntactical and semantic. Preferences are applied using a weight 
management instead of a filter management to guarantee that the correct candidate is 
not rejected. 

Different approaches have been developed to solve definite descriptions. On the one 
hand, most of them, do not make a previous identification of the type of DD (anaphoric 
or non-anaphoric). They attempt to find an antecedent for any DD, if do not find anyone, 
the DD is classified as non-anaphoric. Otherwise, the algorithm provides an antecedent. 
On the other hand, few algorithms attempt a previous identification of DD in order to no 
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spend computational times for solving a DD without solution. Our approach identifies 
previously some DD and after applying the resolution algorithm, the remaining non- 
anaphoric DD are identified. 



2 The Algorithm 

The algorithm contains the following main components. 

1. Clustering module. In this module an algorithm to cluster DDs is applied to classify 
them into anaphoric or non-anaphoric. The clustering task uses the EuroWordNet’s 
ontology instead of a distance. 

2. Coreference module. This module uses a set of constrains and preferences to provide 
the correct antecedent. The following sets of rules are applied: 

- A set of semantic constraints that rule out anaphoric DD-NP dependence. 

- A set of preferences obtained from a empirical study is applied using a weight 
management system. For each preference a set of values are assigned to several 
salience parameters (frequency of mention, proximity, semantic relations) for a 
given NR 

2.1 Clustering Task 

The algorithm goes through the text looking for noun phrases. Once, a noun phrase is 
detected, his head noun is extracted. The head noun is used to obtain from EuroWordNet’s 
ontology his base concept. The clustering technique uses this base concept instead of 
a distance to classify the noun phrases into equivalent classes. If there are at least one 
noun phrase belonging to the same class, then noun phrases are semantically compatible. 
Moreover, if the noun phrase founded in the text is a DD then it is classified provisionally 
as anaphoric and the resolution algorithm is applied. Otherwise, if there are no noun 
phrase belonging to the same semantic class then the DD is classified as non-anaphoric 
and the algorithm goes on. 

Furthermore, taking advantage of the clustered noun phrases into semantically equiv- 
alent classes, the amount of comparison between antecedents and the DD has been re- 
duced although the solution searching space used by the resolution algorithm is made 
up by all previous sentences. This fact is due to that a DD is only compares with noun 
phrases belonged to the same class, because head nouns semantically related through 
synonym, hypernym or hyponymy relations are semantically equivalent. 

The mechanism is based on a simple idea: a DD will be non-anaphoric if it is not 
semantically compatible. But, if there is one that belongs to the same semantic category, 
then it can not be classified (anaphoric or non-anaphoric) without applying the resolution 
algorithm. 

A Word Sense Desambiguation module is needed in order to provide the correct sense 
of head nouns. The Specification Marks module [5] is used in this work to provide the 
correct sense. After applying this module, a manual review has been made to supervise 
the results. 
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2.2 Semantic Constraints 

After generating the semantic network, a set of semantic constraints for ruling out 
anaphoric dependence of a DD on an NP due to non-compatihle semantic relations 
is applied. We present two semantic constraints: 

R1 Two NPs that belong to the same cluster can only be coreferent if they have the same 
head noun or there is a synonym (car - auto) or hyperonym/hyponym (hyperonym: 
car - vehicle, hyponym: car - ambulance) relation between both head nouns. 

R2 A comparison between the modifiers of the DD and the NPs semantically compatible 
is made. If there is an antonym relationship (left - right) between two modifiers then 
the NP is rejected. 

2.3 Preference Management 

The system scores a salience value for each possible antecedent no rejected by constrains. 
The antecedent with highest salience value is chosen as antecedent. For each candidate, 
preferences are applied adding the weight of the fulfilled ones to the salience value of 
the candidate. 

A set of preferences obtained from a empirical study is applied. For each prefer- 
ence, a set of values are assigned to several salience parameters (frequency of mention, 
proximity, semantic relations) for an NP. 

PI Repetition. The system selects the same DD as antecedent (same head noun and 
same modifiers). 

P2 Pre and post-modifiers relation. The system selects antecedents with the same head 
noun and with a semantic relation (synonym, hyperony, hyponym) between pre or 
post-modifiers of DD and antecedent. 

P3 Indirect anaphora (Bridging references'). The system selects antecedents whose 
head nouns are related to the head noun of DD through a synonym, hypernym, 
hyponym relation. Moreover, the system selects these antecedents with pre or post 
modifiers semantically related. 

P4 Antecedent without modifiers. The system first selects the NP with the same head 
noun and later with a semantic relationship between head nouns. 

P5 Gender and number agreement. The system selects the antecedent with gender and 
number agreement. 

P6 Closest. If more than one antecedent have the same higher salience value, then the 
system selects the closest antecedent. This rule guarantees that only one antecedent 
is proposed. 

The weights applied are: if the possible antecedent has the same head noun as the DD, 
then a value of 50 is added to the salience value of the possible antecedent. But, if there 
is a semantic relation (synonym, hypernym or hyponym) between antecedent and DD 
head nouns, then a value of 40 (synonym relation) or a value of 35 (hypernym/hyponym 
relation). Moreover, an amount based on the kind of relationship between modifiers of 

* Definite descriptions with the different head noun as their antecedent were called bridging 
references by Clark [3] 
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both noun phrases (antecedent and DD) is also added to the salience value for each pair 
of modifiers. If a modifier appears with both head nouns then a value of 10 is added to 
the salience of antecedent. If there is a semantic relation between modifiers of both noun 
phrases, a value of 9 (synonym relation) or 8 (hypernym/hyponym relation) is added 
to the salience of the antecedent. If more than one candidate have the same maximum 
value then gender and number criteria are applied. Finally, if more than one satisfies 
the last preference a closest preference is applied. The values applied have been chosen 
experimentally. 

The algorithm does not limit the number of previous sentences to be use in order 
to search the correct antecedent of a definite description, i.e., the algorithm stores all 
previous NPs. However, it only searches the antecedent in the same ontological concept 
of the semantic network reducing the number of comparison. 



Table 1. Distribution of DD in the corpora (manual review) 



Corpus Total Nra-anaphoric Anaphoric DD 

DA lA Total 



L 


674 


477 


148 49 


197 


D 


352 


210 


120 22 


142 


N 


191 


117 


50 24 


74 


Total 


1217 


804 


318 95 


413 



3 Experimental Work 

3.1 Corpora 

The data for the experimentation was taken from three different corpus: The Lexesp 
Corpus (L), deed (notarial) corpus (D) and newspaper corpus (N). The Lexesp^ corpus is 
made up of a set of fragments with different kinds of styles (narratives, newspaper articles, 
etc.) written by different authors. The deed corpus is a corpus made up of several legal 
documments used by the information extraction system EXIT [4]. Newspaper corpus is 
made up of several articles extracted from digital newspaper. In Table 1 a distribution 
of DD in the corpora is shown. We distinguish two different kind of DD: anaphoric and 
non-anaphoric DD. And, anaphoric DD are also divided into direct anaphora (DA) and 
indirect anaphora (lA). 

3.2 Evaluation Phase 

The algorithm was applied to the test corpus obtaining the results showed in the Table 2. 
The systems recognized 759 of 804 non-anaphoric DD. The new entity identification 

^ LEXESP is a Spanish corpus. This corpus has about 5 million of tagged words developed 
by Psychology Department from University of Oviedo, Computational Linguistic Group of 
University of Barcelona and Language Treatment Group from Technical University of Cataluna. 
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(non-anaphoric DD) is made in two stages. Before applying this resolution algorithm, 
the identification algorithm classified previously 304 of 759 non-anaphoric DD using 
only the ontological concept (a precision of 40%). Moreover, the system also classifies as 
non-anaphoric 406 of 455 (759-304) DD after applying the resolution algorithm. Firstly, 
the system applies the automatic generation of the semantic network to identify the new 
discourse entity. Secondly, the constraints of algorithm are only applied to antecedents 
that belongs to the same ontological concept that DD in order to provide the antecedent 
or to classify the DD as non-anaphoric. Table 2 shows the performance achieved by the 
system on the identification task, a recall of 88.3% and a precision of 93.5%. We can 
distinguish between two different sub-task within the anaphora resolution task: direct 
anaphora and indirect anaphora resolution. For direct anaphora resolution (same head 
noun) the algorithm achieves a recall of 79.9% and a precision of 84.4%; for indirect 
anaphora resolution a recall of 60% and a precision of 63.3%. The overall results of DD 
anaphora resolution achieves a recall of 75.3% and a precision of 79.3%. In general, the 
treatment of DD achieves a recall of 83.9% and a precision of 88.8%. 



Table 2. Anaphora resolution results after apply to test corpora 





Non-anaphoric DD 


Direct Anaphora 


Indirect 

Anaphora 


c 


S E R% 


P% 


S E R% 


P% 


S E R% P% 


L 


429 20 89.9 


95.5 


117 20 79 


85.4 


32 16 65.3 66.6 


D 


182 18 86.6 


91 


95 20 79.2 


82.6 


13 7 59.1 65 


N 


99 11 84.6 


90 


42 7 84 


85.7 


12 10 50 54.5 




710 49 88.3 


93.5 


254 47 79.9 


84.4 


57 33 60 63.3 










Overall 








S 


E 


R% P% 








311 


80 


75.3 79.5 



After applying the algorithm to the test corpora, we detected several problems to 
classify the DD as non-anaphoric. The problems are the following: 

- The head noun of noun phrase is not found in the lexical resource (Spanish Wordnet). 
The kinds of these head nouns are proper nouns, alias, places and other words. These 
definite descriptions are classified as belonging to an extra ontological concept. This 
extra ontological concept had been created to group DDs not found in the Spanish 
Wordnet. 

- Sometimes the pre and post-modifiers have different sense but the ontological re- 
source does not provide any opposite relation (red car, blue car). 

After the study of corpus, we observed that most of the time DDs headed by a demon- 
strative refers to an antecedent. It is evident that a demonstrative is used to refer to 
something previously introduced. Only in a dialogue may it be used to introduce a new 
discourse entity (deixis phenomena). 
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Table 3. Indirect comparative evaluation of some algorithms 



Task 


V&P vl 
R P 


V&Pv2 
R P 


C&W 
R P 


Our Algorithm 
R P 


New Entity 
Anaphora resolution 


74% 85% 
62% 83% 


74% 85% 

52% 46% 


53% 55% 


88.3% 93.5% 
75.3% 79.3% 


Overall 


53% 76% 


57% 70% 


53% 55% 


83.4% 88.3% 



3.3 Comparison with Other Approaches 

Several researches from different authors show that most DDs add new discourse entities 
(have not referential properties). We emphasize works developed by Vieira & Poesio [8] 
and Bean & Riloff [1] for English language. The former demonstrates that about the 
50% of DDs introduce a new entity. The latter shows that 63% of them introduce a new 
entity. These works use different corpus (33 articles from WSJ and the Latin American 
newswire articles from MUC-4, respectively). Moreover, the work developed by Munoz 
et al. [6] for Spanish language using Lexesp corpus shows that 52% of DDs do not refer to 
any previous noun phrase. Obviously, the amount of DDs with no referential properties 
depends on the corpus and the language but the different work shows that more than a 
half of DD introduce a new discourse entity. Consequently, algorithms without a prior 
classification (anaphoric or non-anaphoric) of DDs trying to solve the reference of all 
DD spend a lot of unnecessary computational time. Otherwise, algorithms with prior 
classification of DDs that fail to accomplish this task will be carried out even to the 
reference resolution task. 

We classify coreference algorithms in two categories: algorithms that make no at- 
tempt to identify non-anaphoric DD prior to coreference resolution, and those that iden- 
tify anaphoric and non-anaphoric DD before coreference resolution. We emphasize two 
algorithms belonging to the former group: Viera and Poesio’s algorithm [8,7] is based 
on several tests in order to classify the DDs as non-anaphoric expression or to provide 
the correct antecedent. This algorithm achieved a recall of 62% and a precision of 83% 
in solving direct anaphora. Bridging descriptions were evaluated by hand, and 61 re- 
lations of 204 in the corpus were achieved. Moreover, the system achieved a recall of 
74% and a precision of 85% for identifying new discourse entities (non-anaphoric DD). 
Bean and Rillof [1] proposed a corpus-based mechanism to identify non-anaphoric DD. 
This algorithm uses statistical methods to generate lists of non-anaphoric DD (called 
existencial by Bean & Rilloff) and DD patterns from a training corpus. This algorithm 
for identifying non-anaphoric NPs achieves a recall of 78% and a precision of 86%. 
Cardie & Wagstaff (C&W)[2] develops a clustering algorithm to solve coreeferences 
produced by noun phrase achieving a recall of 53% and a precision of 54%. 

Comparative Results. Table 3 shows a comparaison between four algorithms. The 
two first algorithms are two different versions developed by Vieira & Poesio (V&P 
vl and V&P v2). The first version (vl) only treats and resolves direct anaphora and the 
identification of new discourse identity. The second version of Vieira & Poesio algorithm 
(v2) treats the bridging references. The second version’s results showed in this table are 
only for bridging reference because it apply the same heuristics for direct anaphora as in 
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vl. Kameyama’s algorithm is not a specific algorithm for DD, hut a general algorithm 
to treat pronouns and DD. The results showed in the anaphora resolution task is only for 
definite NP and the overall result is for pronouns and DD. The results of our algorithm 
shown are achieved to solve direct and indirect anaphora and to classify new discourse 
entity. 

Table 4 shows the results achieved hy Vieira & Poesio and our algorithm for each 
task. Our algorithm achieves better results than Vieira & Poesio algorithms, its can not 
be directly compare because different languages, different corpus and a different type 
of coreference are treated. 



Table 4. Indirect comparative of different task 



Task 


V&P 


Our Alg. 




R% 


P% 


R% P% 


New Entity 


74 


85 


88.3 93.5 


Direct Anaphora 


62 


83 


79.9 84.4 


Bridging reference 


52 


46 


60 63.3 



4 Conclusion 

We have developed a semantic-driven algorithm for Spanish definite description resolu- 
tion. The algorithm classifies DD into anaphoric or non-anaphoric, and it resolves direct 
anaphora and indirect anaphora. On the one hand, the algorithm is based on the auto- 
matic generation of a semantic network from the base concept of the head nouns. These 
head noun are extracted from all NPs that appear in the text and their base concepts are 
extracted using the Spanish Wordnet ontology. The mechanism only uses semantic infor- 
mation to make a previous classification of DD (anaphoric or not). The results obtained 
for the above mechanism show no high scores. Only a 40% of non-anaphoric descrip- 
tion can be identified using exclusively ontological concepts. However, if we use other 
semantic information such as synonym, hypernym, hyponym relation (restriction R1 & 
R2) the correct identification of non-anaphoric DD increases the precision to 93.5%. On 
the other hand a set of semantic preferences is applied to candidates in order to provide 
the correct antecedent, achieving a precision of 79.3% and a recall of 75.3%. 
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Abstract. In this paper, a method to solve pronominal anaphora using full parsing 
is presented. This approach applies, like other traditional approaches do, a set of 
constraints and preferences to the list of candidates previously appeared in the text. 
In our approach, for Spanish, semantic criteria for the resolution process is also 
added. Furthermore, there have been defined some co-position preferences based 
on the study of the Solution Searching Space. The preliminary results obtained 
from the evaluation of this method have given an interesting success rate of 79.3% 
correctly resolved pronouns. 



1 Introduction 

Anaphora resolution is, without doubt, one of the most relevant tasks in Natural Language 
Processing. There has been a big quantity of approaches oriented to anaphora resolution 
and some of them have demonstrated a good success rate on this objective. Nevertheless, 
most of them have not been able to use semantic knowledge to be applied in the anaphora 
resolution process. The main reason for this is that the use of semantic knowledge entails 
an important waste of time and computational resources. 

Furthermore, from the work on anaphora resolution in Spanish, there are no reference 
to the use of full parsing in the preprocessing job. In this paper, we present a method to 
solve the anaphora that uses as input the output of a full parser and applies a mechanism 
based on constraints and preferences using syntactic and semantic knowledge. 



2 Pronominal Anaphora 

Although anaphora can be classified and treated in its different types, we are going to 
focus this study on pronominal anaphora, and, in particular, in the most treated pronoun 
in the bibliography: the personal pronoun in its third person. 



* This paper has been partially supported by the Spanish Government (CICYT) project number 
TIC2000-0664-C02. 
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2.1 Motivation: Personal Pronouns 

All personal pronouns in Spanish, unlike in English, provides morphological information 
that is basic for anaphora resolution process. In this way, Spanish personal pronouns can 
be singular and plural and masculine and feminine' . 

From the syntactic point of view, a personal pronoun can have different roles in the 
sentence.We have classified them into two groups: 

- subject pronouns: they have the subject role in the sentence, as it can be seen in 
example^ (1). Furthermore, it is very common to find this kind of pronoun (zero- 
pronouns) omitted in the sentence (represented by 0) -in the example (2) there is 
an ommited subject that correspond to the pronoun ’ellos’ (masculine ’they’), co- 
referent to ’los policfas’ (’the policemen’). For this reason, the method will need an 
algorithm to detect omitted pronouns. 

(1) Arzalluzi califico de ’intrascendente’ que las elecciones vascasj se celebren 
’unos meses antes o despues’. Eh dijo preterit, de hecho, que (0-ellas)j se 
celebrasen cuando se ’serenaran’ ciertas cuestiones. 

Arzalluzi described as ’unimportant’ that the Basque electionSj were celebrated ’some 
months before or after’. Hei said to prefer, in fact, that they^ were celebrated when some 
matters calmed down. 

(2) Los policiasi tienen derecho a permanecer en esa zona de seguridad, segun el 
armisticio de 1999, solo si (0-ellos)i llevan armas ligeras. 

The policemeui have a right to stay in that security zone, according to the 1999 armistice, 
only if theyi carry light weapon. 

- complement pronouns: in this group, the rest of the personal pronouns have been 
included, that is, pronouns with direct and indirect object role, and those with cir- 
cunstancial complement role, commonly included in a prepositional phrase. Next 
examples show three pronouns with direct (3), indirect (4) and circunstantial com- 
plements (5). 

(3) Permitin'an detectar y seleccionar potenciales talentos deportivosi entre la 
poblacion infantil y orientarlosi desde los primeros ahos a un deporte concreto. 
They would allow to detect and select potencial sport talentSi among children ’s popula- 
tion and guide themi from the first years to a concrete sport. 

(4) El Gobiernoi sospecha que el PS0E quiere endosarlei su hipotetica decision 
de recurrir al Constitucional. . . 

The Governmenti suspects that the PSOE wants to lumber iti with the hypothetical 
decision of coming to the Constitutional. . . 

* From the semantic point of view, there is no animacy feature related to the pronouns in Spanish, 
it means, unlike in English, a noun can be referred using a masculine or feminine pronoun 
depending on its gender and number. Nevertheless, the Spanish neutral pronoun is not used 
to refer to inanimate objects but to refer morphologically neutral elements such as facts or 
textual elements such as sentences or full paragraphs. Anyway, sometimes, personal pronouns 
referring to inanimate things are usually replaced by demonstrative pronouns 
^ The examples we present in these paper has been extracted from the same corpus used for 
the evaluation phase and belong to fragments of the electronic version of the newspaper El 
Pais (http : //www. elpais . es). In each example, a common way of coreference tagging 
has been used: those elements (in this case, the pronoun and the noun phrase) that corefer are 
identificated by the same subscript. 
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(5) Mexico no es solo petroleo, y las cadenas de montajei establecidas en la fron- 
tera, las maquiladoras, la mayoria de capital estadounidense, florecieron como 
hongos a partir de 1994, y pese a las denuncias sobre las leoninas condiciones 
laborales, 1 .200.000 mexicanos encontraron trabajo en ellaSi. 

Mexico is not only petroleum, and the assembly linesi established in the frontier, the 
‘maquilladoras’ , most of the American capital, bloomed like the mushrooms from 1994, 
and despite the legal action about the bad labour conditions, 1.200.000 Mexican people 
found work in themi. 

Concerning to the type of antecedents we are going to treat, the list of candidates where 
the method makes the search of the correct antecedent is going to he formed only hy 
noun phrases, as it can be seen in the previous examples. 



2.2 Background 

Theories and formalisms used in the anaphora resolution-oriented work have different 
natures. Our approach is based on the use of semantic information sources for identifying 
the antecedent of the pronoun. Thus, use of semantic information will be the main 
criterion in studying related works. 

Hobbs in [4] proposed one of the hrst approaches to anaphora resolution using only 
syntactic restrictions for selecting the correct antecedent of a pronominal anaphora. 
Lappin and Leass in [7] dehned an algorithm based also on syntactic information, and 
Kennedy and Boguraev in [6], on the basis of this work, used enriched restrictions 
and morpho- syntactic preferences with information relative to the context of certain 
pronouns. Mitkov and Stys in [8] proposed a pronominal anaphora resolution system 
based on morpho-syntactic restrictions and a series of preferences (antecedent indicators) 
that vary according to the text type. Baldwin in [1] presented CogNIAC, a pronominal 
anaphora resolution system. He also used morpho-syntactic restrictions (gender and 
number agreement and c-command) and a series of heuristic preferences. These ap- 
proaches do not make use of semantic information sources as an additional resource 
for pronominal anaphora resolution. For this reason, they are called knowledge-poor 
systems. 

On the other hand, Ferrandez et al. in [2] included a pronominal and adjective 
anaphora resolution module from a partial parsing in a NLP system for Spanish. It 
uses a grammatical formalism based on restrictions and morpho-syntactic preferences. 
Moreover, this method proposes the addition of semantic information for restricted 
domain texts with IRSAS, a system developed by Moreno et al. in [9]. In this line, 
another approach to use semantic information to help the anaphora resolution tasks is 
shown by Saiz-Noeda et a/, in [ 1 1 ] . Moreover, LaSIE system presented by Gaizauskas et 
al. in [3] and by Humphreys et al. in [5] was designed as a general purpose information 
extraction (IE) system where a discourse model based on a predefined domain model is 
built, using the semantic analysis supplied by the parser. The domain model represents 
a hierarchy of domain-relevant concept nodes, together with associated properties. 
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3 The Full Parsing and the Method 

The method we propose is based on the classic anaphora resolution methods that build a 
list of possible antecedents of a pronoun. They apply, on the one hand, a set of constraints 
to erase ‘false’ candidates from the list and, on the other hand, a set of preferences that 
try to decide the best candidate to be the correct antecedent. 

In this work, we try to introduce new selection criteria to increase the traditional 
approaches’ results in the antecedent choice. These criteria are based on the full parsing 
that is used as the input for the method and on the use of semantic information that 
complement the rest of linguistic sources. 

Main steps of the method are the detection of the anaphoric pronoun, the construction 
of the list of candidates formed by the preceding noun phrases, the application of lin- 
guistic constraints in order to eliminate incompatible candidates and finally the selection 
of the correct antecedent using the set of preference criteria. 



3.1 Full Parsing 

The anaphora resolution method has been applied to the output of a full parser that 
provides morphological and syntactic information about each word in the text. This 
analyzer is the Conexor’s FDG Parser (see [12]) and tries to provide a build dependency 
tree from the sentence. When this is not possible, the parser tries to build partial trees 
that often result from unresolved ambiguity. The analyzer assigns to each word a text 
token, a base form and functional link names, lexico-syntactic function labels and parts 
of speech. 



3.2 Identification of the Anaphoric Pronoun 

From the full parsing output, the method detects the pronoun using its part-of-speech 
label (PRON). Furthermore, the information provided by the parser is wider and some- 
times allows to know interesting information about the pronoun such as the role in the 
sentence (subject, direct object, indirect object) and its gender and number. 



3.3 Solution Searching Space: Building the List 

A set of noun phrases is selected and stored in a list of candidates. The selection of these 
candidates is based on the Solution Searching Space (SSS) definition and establishes the 
number of sentences that are going to content the candidates. 

The SSS has been obtained from an empirical study of the distribution of the solutions 
for the personal pronouns in the corpus regarding to their position in the text. Table 1 
shows that most of the solutions are found in the same sentence (14 %h- 48%), less in 
previous sentences and a few of them in the previous paragraph. For this reason, joined 
to the fact that a new paragraph introduces a context change (more or less severe) , the 
method must give priority to the candidates closest to the anaphor and little relevance to 
the candidates in previous paragraphs. 
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Table 1. Antecedent distribution in the text 



Antecedent 


Position 


14% 


same ciause 


48% 


diff. ciause, same sentence 


34% 


diff. sentence, same paragraph 


3% 


previous paragraph 



3.4 Constraints: Rejecting Useless Candidates 

Once the anaphor has been detected and the list of its candidate antecedents has been 
built, it is convenient to reject, first of all, those noun phrases that we know certainly they 
cannot co-refer with the pronoun. For this purpose, a set of constraints has been defined. 
These consfrainfs are based on morphological and synfactic information provided by 
both anaphor and antecedent, although some semantic criteria have been applied as it 
will be explained next. 

- Morphological constraints: these are the easiest-to-apply constraints due to we count 
(in general) on the gender and number information of the pronoun and the noun 
phrase. For this reason, the simplest way of applying this constraint is comparing 
both elements from the morphological point of view: 

C1 The antecedent and the pronoun must agree in gender and number 

Nevertheless, it does exist the possibility of morphological non-agreement and coref- 
erence, as in the case of collective nouns -see (7)-. In these cases, it is necessary 
to consider some additional morpho-semantic features that define agreemenf con- 
ditions different from the strictly morphological ones. 

(6) ... la gentei no sabfa que hacer, si dormir, puesto que (0-ellos)i se morian de 
sueho, si comer, [. . .] 

. . . people; didn’t know what to do, whether sleep, because (they)i could hardly keep 
their eyes open, whether eat, [. . . ] 

In this example, the number of ‘ellos’ (masculine they) is plural, while the number 
of its coreferent ‘gente’ (people) is singular. If we want to avoid the rejection of 
the candidate ‘gente’ using the classical morphological constraint, we could add the 
collective noun feature ‘group’ to the noun ‘gente’. 

- Syntactic constraints: we have chosen, in this method, three constraints based on 
the syntactic role of the candidate and the anaphor: 

C2 The subject and the direct object of a verb cannot corefer 
C3 The subject and the indirect object of a verb cannot corefer 
C4 The direct object and the indirect object of a verb cannot corefer 

3.5 Preferences: The Art of Choosing 

There are two main approaches to the way the preferences are applied in the anaphora 
resolution process. On the one hand, the preferences can be ordered according to their 
relevance and applied a similar way (less restrictive) as the constraints, that is, rejecting 




Pronoun Resolution in Spanish from Full Parsing 



89 



candidates only if none of the preferences are satisfied. On the other hand, each preference 
can apply a weight to the candidate being the correct one the most weighted of the list. 
At the end of the process, depending of the method, the ’’survivor” or the “winner” will 
be chosen. 

Our method will apply the second approach. Each preference will give a value to 
each candidate in accordance with the relevance of its fulfillment. The candidate will 
add the value assigned by each preference to the value already stored. At the end of 
the preference stage, the most valued candidate will be the chosen as the antecedent of 
the anaphora. Using this technique, it does not matter the order of application of the 
preferences. The weight of each preference has been decided from an empirical study 
of the corpus. 

We propose a set of preferences formed by three different kinds of them. On the one 
hand, a syntactic preference relates the syntactic roles of both anaphor and antecedent 
(P1). On the other hand, there is a group of ’co-position preferences’ based on the 
position of the candidate regarding the anaphor (P2.1, P2.2 and P2.3) and resulting 
from the previously mentioned study of the corpus (see Table 1). Furthermore, the method 
includes a semantic preference that uses the compatibility concept between a subject and 
its verb and between a verb and its object (P3). This compatibility concept defines the 
affinity degree between the semantic concepts associated to a verb and its subject or 
object^. 

So, the set of preferences is the following: 

PI The candidate and the pronoun have the same syntactic role 
P2.1 The candidate is in the same clause as the pronoun 
P2.2 The candidate is in another previous clause but in the same sentence 
as the pronoun 

P2.3 The candidate is in another previous sentence but in the same para- 
graph as the pronoun 

P2.4 The candidate is the previous paragraph than the pronoun 
P3 The candidate is semantically compatible with the verb of the pronoun 

Finally, if more than one candidate have been chosen after these preferences have 
been applied (draw), this last preference is applied, that is: 

P4 The candidate nearest to the anaphor is considered the correct antece- 
dent 

4 Preliminary Evaluation 

Although the algorithm has not been completely implemented, we have done an inter- 
esting manual evaluation that reveals the main weak and strong points of this method. 
As evaluation corpus we have used a set of texts containing news regarding the common 
topics in a newspaper. 

^ A method that shows the use of these noun-verb pattern extracted from two ontologies and 
oriented to anaphora resolution is detailed in [10]. 




90 



M. Saiz-Noeda, M. Palomar, and L. Moreno 



In this experiment, we defined a SSS of one paragraph, it means, we did not include in 
the list of candidates those noun phrases located in the previous paragraph (so, preference 
P2.4 has been eliminated). From a fragment of corpus containing 58 anaphoric pronouns 
(including omitted pronouns), the average of candidates per anaphor was aproximately 
12 . 

Once the candidates passed through the constraints’ filter, the preferences were ap- 
plied in the way already explained, weighting each candidate as shown in Table 2 for 
each preference. 



Table 2. Weights for preferences in the evaluation 



preference 


PI 


P2.1 


P2.2 


P2.3 


P3 


weight 


-1-30 


-1-30 


-1-20 


-1-10 


-1-30 



Although this initial approach is something naive, the results obtained in the prelimi- 
nary evaluation reveal a success rate of 79.3% anaphors correctly solved. Apart from the 
incorrect resolutions, main unavoidable problems detected in the anaphora resolution 
process come from errors or incomplete parsing in the analysis and incorrect decisions 
induced by previous incorrect resolutions. In this way, it is necessary to define a good co- 
reference chain system to store all the noun phrases (including antecedents) that belong 
to the same conceptual entity. 

5 Conclusions and Outstanding Work 

In this paper we have presented a method to solve the pronominal anaphora using full 
parsing and adding semantic information to the set of preferences. This method has been 
defined using a set of constraints and preferences based on linguistic information. The 
list of candidates has been built using a Solution Searching Space to look for the possible 
antecedent. The preliminary results obtained from the manual evaluation of this method 
has given an interesting success rate of 79.3%. 

Next step in this work is the construction of a prototype that automatically gets the 
parser’s output and applies the set of constraints and preferences in order to solve the 
anaphora. Due to the errors detected in the parsing, it will be necessary to Implement an 
intermediate pre-processing module to add bad-parsed and non-parsed information to 
the anaphora resolution module. This prototype will be inserted in a workbench designed 
to compare it with other approaches to the anaphora resolution. 
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Abstract. This paper advocates the claim that the property of a robustness of 
a certain automatic natural language parser is something different than a simple 
ability to construct a syntactic structure for each sequence of word forms (sentence) 
of a given language. 

The robustness in our terminology should be more accurate in a sense that it should 
be able to distinguish between “good” and “bad” ill-formed sentence. We propose 
to use two measures for this purpose, the node-gap complexity which describes 
the complexity of the sentence with regard to nonprojective constructions, and 
the degree of robustness which takes into account the number of syntactic incon- 
sistencies encountered in the process of robust parsing. These measures make it 
possible to develop a scale of global constraints which allow a kind of gradual 
parsing of both syntactically well-formed and ill-formed sentences of a natural 
language. 



1 Introduction 

The practical experience in the field of natural language parsing shows that every project 
aiming at the (as much as possible) complete coverage of the natural language under 
consideration has to cope with imperfections of the input. Even in case that the input 
texts were thoroughly checked (and grammatically corrected) by human reviewers (for 
example in journals or newspapers), it is often possible to encounter a grammatical error. 
The investigations carried out in the past for Czech have shown that printed texts contain 
numerous errors of a wide range of error types. It is therefore quite clear that if we aim 
at parsing any input sentence from a real text in the given language, we have to build a 
robust parser. 

We would like to introduce a method allowing to formulate a set of constraints which 
control the application of grammar rules so that the same grammar may be interpreted 
in several ways. This approach was inspired by the idea to model the parsing of a natural 
language sentences performed by a human reader when he or she tries to reconstruct the 
syntactic structure of an ill-formed sentence in order to be able capture its structure. This 
simple idea have led us to the formulation of a theory which provides a solid basis for 
the investigation of theoretical and practical issues related to the problem of a complete 
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syntactic parsing of Czech as a language with relatively very high freedom of the word 
order. 

At this point it is also necessary to dehne very important set of terms used in the 
subsequent text. It is often the case that different people in different circumstances 
understand the term of natural language analysis or natural language parsing in various 
ways. 

In order to overcome this terminological uncertainty we have decided to stick to the 
following terminology: 

We are going to use the term natural language analysis for a complex process involv- 
ing not only syntactic analysis, but also semantics, pragmatics and real world knowledge. 
The result of natural language analysis is typically one (best, most acceptable) sentence 
structure. A typical example of the natural language analysis is the analytical level of 
the Prague Dependency Treebank [1], the result of human made analysis taking into ac- 
count all possible sources of knowledge and information (including very broad context 
in which a particular sentence appeared). 

The term ( syntactic ) parsing is going to represent the process of analysis according to 
a grammar (or metagrammar) containing (syntactic) rules and constraints. If the rules and 
constraints capture the rules and constraints of the surface syntax, we talk about a surface 
(syntactic) parsing. The result of (surface) parsing is a set of structures describing all 
possible syntactically acceptable variants of the analysis of a particular input sentence. 
None of the resulting structures is considered to take precedence or to be “better” or 
“worse” than the rest. 



2 Standard Approach to Robustness 

As we have already mentioned above, one of the main topics of this paper is the claim that 
the property of a robustness of a certain automatic natural language parser is something 
different than a simple ability to construct more or less reliable syntactic structure for 
each sequence of word forms (sentence) of a given language. The robustness in our 
terminology should be more accurate in a sense that it should be able to distinguish 
between “good” and “bad” ill-formed sentence. 

Generally is the robustness understood as a property of a particular system to cope 
with all kinds of input regardless if it is ill-formed or well-formed. A typical example of 
a robust parser in this sense is in fact any stochastic parser. The stochastic analysis does 
not provide any means for distinguishing ill-formed sentences from well-formed ones, 
dealing with both types in a uniform way and assigning just one, the most probable, 
structure to each input sentence. In this way it is even possible to obtain a structure for 
a sequence of word forms, which is so much ill-formed that even the native speakers are 
neither able to assign a structure to this sequence, nor to understand, what the sequence 
of words actually means. 

It is quite clear that assigning a structure to following sequences of Czech word 
forms is pointless: 

Pokud, a ndhodne sice celkovy. vsak kten. pouze teto Ke, 

[If, and accidental though total, however which, only this To, ] 

(This sequence contains every sixth word form from a short Czech article) 
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I had to let my house to a nosy patron. 

[And snake it flight we gosling it and noses patron.] 

(This sequence of Czech word forms is a sentence in English, not in Czech.) 

The stochastic parsers typically provide the most probable syntactic structure not 
only for the ill-formed sentences of the first type, but also for the second category. This 
claim is supported by the result of parsing our second sample sentence by means of the 
best existing stochastic parser adapted for Czech, the parser of M. Collins (cf. [2]). We 
have obtained the following structure: 

AuxS{a{I , had, to, let, house{my, to) , nosy) , patron , .) 



3 What Is an Accurate Robustness 

The broad understanding of the robustness may be useful for practical applications, but 
it has nothing to do with the aim at a more adequate description of a particular natural 
language. The robustness we are aiming at is different. 

The first goal we aim at is the ability of the robust parser to guarantee syntactic 
acceptability. While the syntactic parser draws a strict line between sentences belonging 
to the set of syntactically well-formed sentences of a given natural language and those 
which are not, the robust parser should be able to draw similar line between ill-formed 
sentences which may be corrected according to the syntax of a particular language 
and those which are definitely syntactically unacceptable. The second goal is more 
general: we aim at creating the theoretical framework allowing to shift this borderline in 
a consistent and adequate manner by means of application of a set of general constraints 
expressed through measures defined in this paper (and in the papers about our approach 
published in previous years). This goal in fact means that in general sense our theory 
allows creating a scale of parsers with different degrees of robustness or word-order 
freedom. 

One of the main topics of this paper is the endeavor to describe at least some con- 
straints allowing to achieve these goals. We do not claim that the constraints introduced 
here are the only constraints suitable for the declared purpose, it is quite clear that several 
other types of constraints may be formulated in the future. These constraints will then 
allow creating even more refined scale of parsers and thus they will also support a more 
adequate description of syntactic properties of natural languages with a high degree of 
word order freedom. 

At this point it is also necessary to specify the type of constructions which are going 
to be considered as syntactically ill-formed in the following text. The position adopted 
in this work reflects the fact that very often a sentence is rejected by a human reader for 
some extrasyntactic reasons. For example it may be unintelligible; it may be stylistically 
unacceptable; its preferred reading may violate grammatical rules while there is a second, 
syntactically well-formed reading, with a meaning unacceptable in the given world etc. 

This supports the claim that it is really necessary to draw a clear borderline between 
natural language analysis and ( surface ) syntactic parsing. Rather than to attempt solving 
a wide spectrum of very complicated problems of natural language analysis we would 
like to concentrate on a question of what can be done if only “pure” syntax is taken into 
consideration, thus aiming at the problem of syntactic parsing. The following pages show 
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that even with such a drastic restriction of scope is the problem of robust parsing of a 
language with high degree of word-order freedom very complicated and complex. Thus, 
throughout the following text, wherever we are going to refer to ill-formed constructions 
or sentences, we will have in mind only those constructions or sentences which are ill- 
formed from the point of view of the syntax only, no other factors are going to be taken 
into account. This position is similar for example to the position taken in the classical 
literature in [8]. 

4 Basic Notions 

In this section we would like to describe the basic notions used throughout the whole text. 
First of all, let us introduce the basic data structure (D-trees, a special kind of dependency 
trees) we are going to use both for visualization of results of the parsing process and 
also for theoretical purposes. This data structure is a traditional means for visualization 
of the syntactic structure of a sentence in the European continental linguistic tradition. 
A number of different formalizations exist, e.g. [8]. Our definition differs technically 
and contains more information (the guideliness how to draw D-trees) than the traditional 
ones and in this manner better reflects the needs of subsequent chapters. It was defined 
for example in [4]. 

In the following text we are also going to talk very often about one very important 
property of dependency trees, about nonprojectivity. Let us build this term gradually, 
through the introduction of a projection of a node of the D-tree (in some of our papers 
we have used the term coverage), which then allows to formulate the definition of the 
term projectivity/nonprojectivity in a manner suitable for further dehnitions of measures 
of nonprojectivity . 

Definition: Let T be a D-tree and let t6 be a node of T. The set of horizontal indices 
(i.e. horizontal positions expressed as natural numbers) of all nodes v of the tree T such 
that there exists an (oriented) path from u to m will be called the projection of u within 
T and marked off as Cov{u, T). When dehning Cov{u, T), we take into consideration 
also the empty path, hence Cov{u, T) always contains the horizontal index of u. Let T 
be a D-tree over the sentence ru = oi . . . a„, let t6 be a node of the tree T and let 

Cov{u,T) = {ii,Z2, . . <%2< ■■■ < ik-i < h- 

Lori < j < fc we say that the pair (ij, ij+i) creates a hole in Cov(u,T), iff > 1. 

The last step we need to make before we’ll be able to define the terms of projectivity 
and nonprojectivity is the definition of the following measure: Let T be a D-tree over the 
sentence w and let m be a node of the tree T. We define the measure dNh{u, T) (number 
of holes in a dependency subtree rooted in u) as the number of holes in Cov(u, T). It 
is now easy to formalize the traditional terms of projectivity and nonprojectivity of a 
dependency tree (for previous usage of these terms cf. [8] using the notions defined in 
previous parts of this definition. Let T be a D-tree over the sentence w. The number 
dNh{T), equal to the maximum of {dNh{u, T); u G T}, will be called the magnitude 
of non-projectivity of the sentence w with the structure T. If dNh{T) = 0 holds, we say 
that the sentence w with the structure T is projective, otherwise we say that the sentence 
w with the structure T is nonprojective. 
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According to our definitions is the sentence 

’ Nepodarilo se otevnt zadany soubor. ’ 

[Lit.; It_was_not_possible Refl. to.open specified file.] 

(It was not possible to open the specified file.) projective and the sentence 
’Zadany soubor se nepodarilo otevnt. ’ 

[Lit. : Specified file Refl. it_was_not_possible to_open.j 
(The specified file was the one it was not possible to open.) is nonprojective. 

In our previous papers [6] we have defined a class of formal grammars called RFODG 
(Robust Free-Order Dependency Grammars). It was developed as a tool for the descrip- 
tion of sentences of a language with high degree of word-order freedom and for the 
differentiation of syntactically well-formed and ill-formed sentences. For this purpose it 
uses two kinds of classification of the set of symbols - not only the traditional one (ter- 
minals and nonterminals), but also the classification into positive and negative symbols, 
which serve for the localization of syntactic inconsistences (errors). 

For the sake of the following explanations it is useful to introduce a notion of a DR- 
tree (delete-rewrite-tree) according to a RFODG grammar G. A DR-tree maps the es- 
sential part of history of deleting dependent symbols and rewriting dominant symbols, 
performed by the rules applied. The definition of a DR-tree was introduced in [3]. 

Put informally, a DR-tree (created by a RFODG G) is a finite tree with a root and 
with the following two types of edges: 

a) vertical iV — edges): these edges correspond to the rewriting of the dominant symbol 
by the symbol which is on the left-hand side of the rule (of G) used. The vertical edge 
leads (is oriented) from the node containing the original dominant symbol to the node 
containing the symbol from the left-hand side of the rule used. 

b) oblique: these edges correspond to the deletion of a dependent symbol. Any such edge 
is oriented from the node with the dependent deleted symbol to the node containing the 
symbol from the left-hand side of the rule used. 

It is possible to define analogical notions as for the D-tree, namely the notion of 
DR-projection and DR-projectivity (instead of D-projection and D-projectivity). 
Definition: TN{G) denotes the set of complete DR- trees rooted in a symbol from St, 
created by G. If Tr € TN{G), we say that Tr is parsed by G. Let w = ai 02 . . . a„, 
w G T* ,Tr G T A(G), and let the i-th leaf of Tr contains the symbol Oj fori = l,...,n. 
In such a case we say that the string w is parsed into Tr by G. The symbol L(G) represents 
the set of strings (sentences) parsed into some DR-tree from r(V(G). We will also write 
TN{w, G) = {Tr; w is parsed into Tr by G}. 

5 A Relationship between DR- Trees and D-Trees 

There is a very straightforward correspondence between DR-trees and D-trees. Infor- 
mally, a D-tree is obtained by the contraction of vertical paths of a DR-tree and by the 
addition of information about the distance of a particular node from the root (vertical 
index). The symbol belonging to the node of the D-tree is taken from the node of the 
DR-tree with identical horizontal index and with the maximal distance from the root. 

Generally, there may be a difference in the non-projectivity measure for DR-trees 
and D-trees. The non-projectivity measure of D-trees may have lower value than the 
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measure for the DR-tree from which was the particular D-tree contracted. The degree of 
robustness (cf. the following dehnition) is identical for both types of trees. The actual 
process of the testing and debugging of the metagrammar is done for D-trees. This 
decision in fact substantially simplifies the metagrammar, because it is not necessary to 
take care about the order of application of individual metarules. 

There is one more measure, which is important for the description of structures 
obtained as a result of robust parsing and for the formulation of global constraints on 
parsing. This measure describes the degree of “ill-formedness” of a particular tree. 

For this purpose we will need slightly modified D-trees. The original definition of D- 
trees does not allow keeping track of syntactic inconsistencies (errors) encountered in the 
parsing process. There is a very natural way how to add this information to standard D- 
trees, namely adding information about the location of syntactic inconsistencies (errors) 
to D-trees. 

Definition: Let Tr G TN{w, G) and Tr is a robust DR-tree over a sentence w. A node 
of Tr is negative, if it contains a negative symbol (nonterminal). We denote as Rob{Tr) 
the number of negative nodes in Tr. Rob{Tr) is called the degree of robustness of Tr. 



6 The Mutual Relationship of Nonprojectivity Measures 
and the Degree of Robustness 

As we have already mentioned in previous chapters, we understand the robust parser as 
a tool for assigning a syntactic structure(s) to both ill-formed and well-formed sentences, 
which do not violate certain global constraints and also as a tool for filtering out the 
sentences which violate them. 

The measures we have defined in the previous section represent one of the possible 
global constraints which may be used for the robust parser. The degree of robustness 
and the nonprojectivity measures capture the complexity of sentences relevant from the 
point of view of a robust parsing of languages with a high degree of word-order freedom. 
They provide a theoretical base for the division of a parsing process into phases, whose 
aim is to analyze input sentences step by step, with more relaxed global constraints for 
each subsequent phase. 

The way, how the scale of global constraints imposed on a particular metagrammar 
may be used, can be demonstrated on the existing implementation of our robust parser 
(cf. [7]). We have applied the global constraints on nonprojectivity and on the degree of 
robustness in a slightly reduced manner, we did not work with the full scale of constraint 
relaxation. This allowed to divide the parsing process into four phases. 

The first phase, called a positive DR-projective phase, tests, whether the set of DR- 
projective positively parsed trees is empty. If not, then the sentence is syntactically 
correct and the parser issues all DR-projective trees representing the syntactic structures 
of all syntactically acceptable readings of the sentence. 

If the sentence is not DR-projective or if it is syntactically ill-formed, the second 
phase, positive D-projective is applied. It is similar to the first phase. The difference 
is the possibility to derive non-projective DR-trees, which are contracted into projec- 
tive D-trees (D-projectivity generaly allows to derive more trees than DR-projectivity, 
D-projective analysis is more powerfull than DR-projective one cf. [3]). 
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When the second phase is also unsuccessful, the third phase called negative D-projective 
or positive nonprojective, starts. It tries to find either a positively parsed nonprojective 
tree or a D-projective tree with the degree of robustness equal or greater than one, i.e. 
a D-projective tree representing an ill-formed sentence. If the parser succeeds, it stops 
and issues all relevant trees as the result of parsing. 

If the third phase fails, the partial results are handed over to the fourth phase, which 
is called negative nonprojective. It tries to find a nonprojective tree with the degree of 
robustness equal or greater than zero. This phase is the last one and if it fails, the whole 
parser fails and issues a relevant error message. Much more subtle scale of phases is 
currently being developed by Tomas Holan [5]. 

A combination of negative projective and positive nonprojective parsing in the second 
phase is quite natural. In some cases it is very difficult or even impossible to find out if 
the sentence is correct and nonprojective or ill-formed and projective. 

7 Conclusion 

This paper describes a method how it is possible to develop a scale of global constraints 
which allow a kind of gradual parsing of both syntactically well-formed and ill-formed 
sentences of a natural language. This approach allows to develop a more subtle robust 
parser, reflecting the fact that for human readers there is a certain limit, behind which it is 
very difficult or even impossible to assign a syntactic structure to an ill-formed sentence. 
We neither claim nor think that the constraints on nonprojectivity and robustness are the 
only constraints playing a role in this effort. We did not intend to develop a complete 
system reflecting all the subtleties of natural languages, we rather concentrated on the 
development of a method allowing to combine several global and local constraints of 
different types to be combined and thus prepare a flexible system, which can be tailored 
to the needs of a particular natural language. 

Unlike the traditional approach, which typically concentrates on grammars describ- 
ing (and analyzing) syntactically well-formed sentences and usually defines a proper 
subset of a set of syntactically correct sentences of a natural language (such system 
is hardly ever complete due to the existence of irregularities, exceptions and rare phe- 
nomena in every natural language), we have extended our approach towards an opposite 
direction: our metagrammar is able to parse some kind of a “robust closure” of a particular 
natural language and thus it allows to approach the borderline between the syntactically 
correct and incorrect sentences from both the inside and outside. In addition to this, our 
method also allows shifting the borderline between parsable and imparsable sentences by 
means of application and relaxation of certain global constraints only, without the need 
to modify the metagramar. We believe that our method may also be easily adopted for 
other natural languages, not only Czech. It is also clear that for languages of the different 
types it probably will be necessary to develop a different set of global constraints. 
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Abstract. This paper deals with the effective implementation of the new Czech 
morphological analyser a j ka which is based on the algorithmic description of the 
Czech formal morphology. First, we present two most important word-forming 
processes in Czech - inflection and derivation. A brief description of the data 
stmctures used for storing morphological information as well as a discussion of 
the efficient storage of lexical items (stem bases of Czech words) is included 
too. Finally, we bring some interesting features of the designed and implemented 
system a j ka together with current statistic data. 



1 Introduction 

Typically, morphological analysis returns the base form (lemma) and associates it with 
all the possible POS (part-of-speech) labels together with all grammatical information 
for each known word form. In analytical languages a simple approach can be taken: it is 
enough to list all word forms to catch the most of morphological processes. In English, 
for example, a regular verb has usually only 4 distinct forms, and irregular ones have at 
most 8 forms. On the other hand, the highly inflectional languages like Czech or Finnish 
present a difficulty for such simple approaches as the expansion of the dictionary is at 
least an order of magnitude greater' [4]. Specialised finite-state compilers have been 
implemented [1], which allow the use of specific operations for combining base forms 
and affixes, and applying rules for morphophonological variations [3]. Descriptions of 
morphological analysers for other languages can be found in [8,11]. 

Basically, there are three major types of word-forming processes - inflection, deriva- 
tion, and compounding. Inflection refers to the systematic modification of a stem by 
means of prefixes and suffixes. Inflected forms express morphological distinctions like 
case or number, but do not change meaning or POS. In contrast, the process of derivation 
usually causes change in meaning and often change of POS. Compounding deals with 
the process of merging several word bases to form a new word. 

Czech belongs to the family of inflectional languages which are characterised by the 
fact that one morpheme (typically an ending) carries the values of several grammati- 
cal categories together (for example an ending of nouns typically expresses a value of 
grammatical category of case, number and gender). This feature requires a special treat- 
ment of Czech words in text processing systems. To this end, we developed a universal 

* As our effective implementation of spell-checker for Czech based on finite state automata 
suggests, it does not necessarily mean that no application can take advantage of a simple listing 
of word forms in highly inflecting languages. 
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morphological analyser which performs the morphological analysis based on dividing 
all words in Czech texts to their smallest relevant components that we call segments. 
The notion of segment roughly corresponds to the linguistic concept morpheme, which 
denotes the smallest meaningful unit of a language. 

Presented morphological analyser consists of three major parts: a formal description 
of morphological processes via morphological patterns, an assignment of Czech stems 
to their relevant patterns and a morphological analysis algorithm. 

The description of Czech formal morphology is represented by a system of inflec- 
tional patterns and sets of endings and it includes lists of segments and their correct 
combinations. The assignment of Czech stems to their patterns is contained in the Czech 
Machine Dictionary [10]. Finally, algorithm of morphological analysis using this infor- 
mation splits each word into appropriate segments. 

The morphological analyser is being used for lemmatisation and morphological 
tagging of Czech texts in large corpora, as well as for generating correct word forms, 
and also as a spelling checker. It can also be applied to other problems that arise in 
the area of processing Czech texts, e.g. creating stop lists for building indexes used in 
information retrieval systems. 



2 Czech Inflectional Morphology 

The main part of the algorithmic description of formal morphology, as it was suggested 
in [10], is a pattern definition. The basic notion is a. morphological paradigm- a seX of all 
forms of inflectional word expressing a system of its respective grammatical categories. 

As stated in [5], the traditional grammar of Czech suggests much smaller paradigm 
system than there is existing in reality. For this reason we decided to build quite large set 
of paradigm patterns to cover all variations of Czech from the scratch. Fortunately, we 
had not been limited by technical restrictions,^ thus we could follow the straightforward 
approach to the linguistic adequacy and robust solution. 

The detailed description of all variations in Czech paradigms enables us to define 
application dependent generalisations of the pattern system. For example, if we do not 
need to take into consideration archaic word forms for a specific application, the number 
of paradigm pattern (that reaches 1500 in its fully expanded form) can be automatically 
reduced considerably. 

Noun paradigms consist of word forms in particular cases of singular and plural. 
Verbs have more paradigms - for present tense, for imperative forms, etc. 

For example, the nouns hora (mountain), slza (tear) and reka (river) display the 
following forms in the paradigms: 

Nom. Gen. Dat . Acu. Voc . Loc . Ins. 

hora hory hore horn horo hof e horou 

hory hor horam hory hory horach horami 
slza slzy size slzu slzo size slzou 
slzy slz slzam slzy slzy slzach slzami 
reka reky fece reku reko fece fekou 
f eky rek fekam reky reky f ekach rekami 

^ Hajic [5], e.g., indicates that his system is limited to 214 paradigm patterns. 
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As we can see, the corresponding word forms in the paradigms have the same ending. 
That is why we can divide the given word form into two parts - a stem and an ending. 
We then obtain the following segmentation: 

hor-{ a,y,u,o,ou} hof-{e,e} 
hor-{y,_,a m, y , y , ach, ami} 
slz-{a,y,e,u,o,e, ou} 
slz-{y,_,am,y,y, ach, ami} 
rek-{a, y, u, o, ou} fec-{e,e} 
rek-{y, am, y, y, ach, ami} 

In the paradigm for the word hora, there are two alternative stems (hor and hof)\ 
in the paradigm for the word slza, there is only one stem slz', and in the paradigm for the 
word reka, there are again two alternative stems rek and rec. We can also identify four dif- 
ferent ending sets Sl={a, y, u, o, ou}, S2={e , e}, next S3={y , _, am, y, y, ach, ami} 
and S4={a,y,e,u,o,e, ou}, but it is clear that S4=S1+S2. 

This observation leads us to a system of ending sets. We make distinction between 
two types of sets - basic and peripheral ending sets. The basic ones contain endings that 
have no influence to the form of the stem, while endings from the peripheral ending sets 
cause changes in the stem. In our case, sets Si and S3 are basic and S2 is peripheral, 
because the ending e causes alternation change of the last letter r to r in the stem hor 
and, similarly, A: to c in the stem rek. 

Moreover, we can put all the endings from set Si and S3 into one (newly created) 
set, say S5, because they are basic and are common for stems hor, slz and rek. Now we 
can shortly write previous paradigms in the following way: 

hor-S5 hof-S2 
slz-S5+S2 
f ek-S5 fec-S2 

Every ending carries values of grammatical categories of the relevant word form. 
For example, all endings in previously defined sets are characterised as endings of nouns 
in feminine gender. Endings from sets S 1 and S 2 originate from the singular paradigm, 
the others (from S3) express plural. Thus, the set S5 now includes endings from both 
singular and plural paradigm and this information must be preserved and stored in the 
system, so we decided to use the following data structure for storing ending sets: 

S5=[1FS.1 (a,l) (y,2) (u,4) (0,5) (ou,7) 

[IFF.] (y,l) (_,2) (am, 3) (y,4) (y,5) 

(ach, 6 ) (ami , 7 ) 

S2= [IFS . 1 (e, 3) (e, 6) 

Since endings, grammatical tags and values of grammatical categories repeat in the 
definitions of sets, we store them in unique tables and use references to these tables in 
the definitions of sets. 

In the next step, we perform further segmentation of stems into a stem base and an 
intersegment. The stem base is the part that is common to all word forms in the paradigm 
(it doesn’t change) and the intersegment is a final group of the stem which forms changes. 
We obtain the following segmentation of stems: 
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ho-{r , f} 
fe-{k, c} 

Since stems hor and rek can be followed by the endings from the set S5, while 
stems hor and fee can he followed hy endings from the set S2, and stem slz accepts 
endings from both S5 and S2, we have to store the information about the only possible 
combinations of stem bases, intersegments and endings in our system in the form of 
a pattern definition. To this end, we use the following data structure: 

hora+<r>S5<f>S2 
slza+<_>S5 , S2 
f eka+<k>S5<c>S2 

A pattern is denoted by its unique identifier (hora, slza, feka) and if consists of blocks 
that are prefixed with an intersegment visually closed in <>. The special character 
stands for an empty intersegment. Each block then contains a list of identifiers of sets. 
Identifiers of sets are visually separated by a comma in lists. Again, since intersegments 
and lists of identifiers repeaf in fhe definitions of patterns, we store them in unique tables 
and use references to these tables in the definitions. 



3 Derivative Processes 

As have been shown in the previous paradigms, the morphological process of inflection is 
captured by means of paradigms in our system. Compounding does not play crucial role 
in Czech morphology if compared with other languages, e. g. German [7]. Therefore, 
the description of derivative processes remains untouched so far. It will be discussed in 
this section. 

The process of morphological derivation of new words, primarily with distinct POS 
categories, is considered as a higher degree of morphological process, in the level above 
the inflection. Indeed, for example, a particular class of deverbative adjectives can be 
derived from the derivation paradigm of transitive verbs. A hierarchical system of mor- 
phological paradigms has been implemented as a tool able to capture different levels of 
the Czech morphology. 

Hierarchical patterns are constructed fully automatically from the binding defined 
on the level of basic forms connecting always one lemma with another one by a specified 
type of a link. If a process could be described as a n-ary relation, it would be partitioned 
into n — 1 binary relations. This partitioning is much more flexible and allows automatic 
generalisations of derivation relations. To demonstrate the derivation binding on the level 
of lemmata, we present the following example with participles: 

pocitat-- (DEVESUBST) --pocltani count --counting 

pocitat-- (DEVEADJPAS) --pocitany count --counted 

pocitat- - (DEVEADJPASSHORT) - -pocitan count--is counted 
pocitat- - (DEVEADJACTIMPF) - -pocitaj ici count--is counting 

From the given example it follows that each link connects one base form of a word 
with another one and names such relation. If the label of a base form is unambiguous 
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and therefore it can be used as a primary identifier it is sufficient to specify only these 
labels in the binding process. If the label itself can be ambiguous, the pairs of lemma and 
the relevant inflectional pattern are connected. However, even this approach is not able 
to represent completely the dependency of the relation on a particular sense of a word. 
For example, the relation “possessive adjective” applies only for the reading of jefdb 
denoting a bird. This is the reason why we have implemented the system connecting 
pairs of triplets of (sense-id, lemma, paradigm) by a named relation. 

Indexing techniques and dictionary methods [6] used in our implementation allow an 
efficient retrieval of related lemmata. It is also possible quickly return a chosen base form 
for a set of related words - the feature which is highly favourable in several applications, 
e.g. in the area of information retrieval or indexing Internet documents. 

The system of base form binding is not limited to the basic derivative processes de- 
scribed above. The same principle e.g. depicts two types of relation in the level under the 
basic derivation, namely original/adapted orthography and inflectional/non-inflectional 
doublets in the case of loanwords. The former can be demonstrated by the example 
of a link between gymnasium/ gymndzium (in the actual version of our morphological 
analyser we use even more elaborated assignment of these doublet types in the form 
of basic type of relation and more specific subtype). The example of inflectional/non- 
inflectional doublet is the link between the word abbe assigned to the paradigm abbe 
(non-inflectional) and Tony. It is of course possible to model such relations on the basic 
level of inflectional paradigms as a word-form homonymy. However, it would lead to 
the mixture of unrelated forms and would complicate special types of analyses, e.g. 
a style-checker analysis, that could be very interesting. 

There are other relations connecting lemmata above the level of basic derivative 
processes. We take advantage of the standard process and are able to uniformly describe 
such different relations as diminutives (and its degree): 

viiz-- (DIMIN:1) --vozik 
viiz-- (DIMIN:2) --vozicek, 
aspectual relations of verbs: 

fici-- (ASPPAIR) --flkat, 

iterative relations of verbs (together with “degrees”): 

chodit-- (ITER;1) --chodivat 
chodit-- (ITER; 2) - -chodivavat , 

the relations between an animate noun and derived possessive adjective: 
otec-- (MASCPOSS) --otcuv, 

the process of creation feminine from masculine nouns: 
soudce-- (MASC2FEMI) --soudkyne, 
or synonyms and antonyms: 

kosmonaut-- (SYNO) --astronaut 
mlady-- (ANTO) --stary. 

The last class of links brings us directly to other relations that can be found in 
semantic nets like Wordnet [9] . The typical relations of hyperonym/hyponym, part/whole 
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(meronyms) etc. are modelled on the higher level, the level based on synonyms, to be 
able to link groups of synonyms (that are called synsets in the context of Wordnet). 

The possibility of building complex structures of links, e.g. relations of relations, is 
also employed in connecting roots of loanwords to their Czech equivalents. Similarly 
to [12], we are therefore able to relate words derived from the Greek root hard with the 
group of Czech words derived from the Czech root srd, e.g. osrdecmk, kardiostimuldtor, 
srdce, kardiologie. 

4 Czech Morphological Analyser a j ka 

The key point of successful implementation of the analyser is an efficient storage mech- 
anism for lexical items. A trie structure is used for storing stem bases of Czech word 
forms. One of the main disadvantages of this approach are high memory requirements. 
We tried to solve this problem by implementing the trie structure in the form of the min- 
imal finite state automaton. The incremental method of building such an automaton was 
presented in [2] and is fast enough for our purpose. Moreover, the memory requirements 
for storing the minimal automaton are significantly lower (see Table 1). 

There are two binary files that are essential for the analyser. One of them contains 
dehnitions of sets of endings and morphological patterns stored in data structures de- 
scribed in Section 2. The source of this binary hie is a text hie with dehnitions of ending 
sets and patterns. The second is a binary image of the Czech Machine Dictionary and 
contains stem bases of Czech words and auxiliary data structures. We developed a pro- 
gram abin that can read both of these text hies and efficiently store their content into 
appropriate data structures in destination binary hies. 

The hrst action of the analyser is loading these binary hies. These hies are not 
further processed, they are only loaded into memory. The main reason for this solution 
is to allow as quick a start of the analyser as possible. The next actions of the analyser 
are determined by steps of the morphological analysis algorithm. The basic principle of 
the algorithm is based on the segmentation described in Section 2. The separated ending 
then determines values of grammatical categories. More details can be found in [13]. 

Another feature of the analyser is a possibility to select various forms of the basic 
word form (lemma). 

Finally, user can have more versions of binary hies that contain morphological in- 
formation and stem bases and can specify which pair should be used by the analyser. 
Users can take advantage of this feature to “switch on” analysis of colloquial Czech, 
domain- specihc texts etc. 

The power of the analyser can be evaluated by two features. The most important 
thing is number of words that can be recognised by the analyser. This number depends 
on the quality and richness of the dictionary. Our database contains 223,600 stem bases 
and ajka is able to analyse (and generate) 5,678,122 correct Czech word forms. The 
second feature is the speed of analysis. In the brief mode, a j ka can analyse more than 
20,000 words per second on Pentiumlll processor with the frequency of 800MHz. Some 
other statistic data, such as number of segments and size of binary hies, is shown in the 
following Table 1 . 
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Table 1. Statistic data 



#intersegments 


779 


#endings 


643 


#sets of endings 


2,806 


#patterns 


1,570 


#stem bases 


223,600 


#generated word forms 


5,678,122 


#generated tags 


1,604 


speed of the analysis 


20,000 words/s 


dictionary 


1,930,529 Bytes 


morph, information 


147,675 Bytes 



5 Conclusion 

The morphological analyser ajka has been tested on large corpora containing 10® 
positions. Based on the test results, the definitions of sets of endings and patterns as 
well as the Czech Machine Dictionary are being extended by some missing, mostly 
foreign-language stem bases and their appropriate patterns and endings. In its current 
state, a j ka can be used for morphological analysis of any raw Czech texts. 

The analyser ajka can readily be adapted to other inflectional languages that have 
to deal with morphological analysis. In general, only the language-specific parts of the 
system, i.e. dehnitions of sets of endings and the dictionary, which are stored as text 
hies, have to be replaced for this purpose. 
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Abstract. The paper deals with the linguistic problem of fully automatic grouping 
of semantically related words. We discuss the measures of semantic relatedness 
of basic word forms and describe the treatment of collocations. Next we present 
the procedure of hierarchical clustering of a very large number of semantically 
related words and give examples of the resulting partitioning of data in the form 
of dendrogram. Finally we show a form of the output presentation that facilitates 
the inspection of the resulting word clusters. 



1 Introduction 

The task of automatic finding of semantically related words belongs to the class of auto- 
matic lexical acquisition problems that attracted attention of many researchers in the area 
of natural language processing in last decades [1,2,3]. The term “semantical relatedness” 
denotes large group of language phenomena ranging from specific phenomena like syn- 
onyms, antonyms, hyperonyms, hyponyms, meronyms, etc. to more general ones, e.g. 
sets of words used in a particular scientific field or domain. In this paper, the task is 
understood in the wide sense. 

The aim of finding groups of semantically related words is linguistically motivated 
by the assumption that semantically related words behave similarly. Information about 
semantic relatedness of a particular group of words is valuable for humans - consider 
for example foreign language learning and dictionaries arranged according to the top- 
ics. However, the strongest demand comes from the field of automatic natural language 
processing as it is one of the key issues in the solution of many problems in the field, 
namely the problem of selectional preferences or restrictions on particular type of verb 
arguments, in word sense disambiguation tasks, in machine translation, document clas- 
sification, information retrieval and others. 

The question remains how to cluster words according to the semantic domains or 
topics. The answer is motivated by the understanding of the task we have adopted above. 
Using the definition from [4] words are semantically similar (related) if they refer to 
entities in the world that are likely to co-occur. The simplest solution can therefore be 
based on the assumption that the words denoting such entities are also likely to co-occur 
in documents and it suffices to identify these words. 

The first encountered problem when applying this strategy is the frequent coinci- 
dence of genuine semantic relatedness with the collocations in the result. The topic of 
collocation filtering is discussed in the following section. 
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The other prohlem concerns the fact that semantically related words do not need to 
co-occur in the same document. For example, Manning and Schiitze [4] present terms 
cosmonaut and astronaut as the example of words that are not similar in the document 
space (they do not occur in the same documents) but are similar in the word space (they 
occur with the same words).' 

The automatic lexical acquisition has been thoroughly studied in the field of corpus 
linguistics (see e.g. Boguraev and Pustejovsky [5]). The problem of semantic relatedness 
has been approached from the word co-occurrence statistics as well as from syntactic 
point of view [6] . There are also works on automatic enhancement of semantic hierarchies 
that can be viewed as a contribution to the semantic relatedness problem solution. The 
standard reference of the retrieving collocations from text is the work by Smajda [7]. 

The work most similar to ours is discussed in [4]. Manning and Schiitze use log- 
arithmic weighting function f(x) = 1-1- log{x) for non-zero co-occurrence counts, 
25-word context window and cosine measure of semantic similarity. Unlike to our ex- 
periments, they compiled only some 1,000 most frequent words for so-called focus 
words and searched for about 20,000 most frequent words to form the word-by-word 
matrix. Moreover, the experiment described in [4] was aimed at automatic finding of 
the words that were most similar to the selected focus words. On the other hand, we 
present the method for automatic clustering of huge amount of frequently occurring 
words according to their semantic relatedness. 



2 Prerequisites 

2.1 How to Measure Semantic Relatedness 

In the previous section we have defined the object of our interest - semantically related 
words - as words (not embodied in collocations) that are likely to co-occur within similar 
context. This section discusses how to characterize the fact that the words co-occur 
“frequently”. 

Several different methods have been applied to describe the notion of frequency. 
Statistical tests that define the probability of events co-occurrence are the most widely 
used. The t-test (or score), closely related z-score, or Pearson’s (chi-square) test [8] 
belong to this category. The well-known likelihood ratio, that moreover takes advantage 
of clear interpretation, can also serve as a good characterization for these purposes, 
especially in the case of sparse data. Besides these statistically motivated measures 
we can apply the instrument of information theory, namely MI - (pointwise) mutual 
information - a measure that represents the amount of information provided by the 
occurrence of one entity about the occurrence of the other entity [4]. 

It has been shown many times that none of this measures works very well for low- 
frequency entities. For this reason, we have to exclude the low-frequency events from our 

* It seems that the mentioned example does not work today in the time of world co-operation in 
space missions, and especially in the time of space partnership between Russians and Amer- 
icans, as can be demonstrated by the corpus sentence: A part of this project will be joined 
missions of Russian cosmonauts and American astronauts. Notwithstanding this fact, we retain 
this example for its illustrativeness. 
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observation. We have defined pragmatically motivated thresholds for minimal numbers 
of occurrences of examined events. As we have dealt with a huge amount of corpus data 
(approximately 100 millions of words), the restriction means no considerable limitation. 
Moreover, the concentration on the high-frequent events effaces the differences among 
various measures and decreases the dependency of the output quality on the choice of 
a particular measure. The mutual information measure used in our experiments gives 
similar results when compared with other methods and the problems with MI referred 
elsewhere [9] does not emerge. 

2.2 Context Definition 

The other important point in the definition of our goal is what we will understand by the 
notion “co-occurrence in the context”. Context is straightforwardly defined in the area of 
information retrieval - it is given by means of documents. This approach is applicable also 
in the field of corpus linguistics as the majority of corpora is partitioned into documents. 
However, the problem with the direct use of documents is the big variance of document 
size. There are corpora that limit the size of their documents, e.g. documents in Brown 
corpus [10] contain 2000 words and then end on the sentence boundary even if it is 
inside the paragraph. On the other hand, corpora like Bank of English, based on the 
motto “More is better”, throw out no text and therefore the size of documents can range 
from short newspaper notices to the whole books. 

Taking into account the big variance and all the possible problems with topic shift 
within one document we decided to define the notion of context differently. We work 
with the context window < —N,N > , where N is the number of words on each side of 
the focus word. The context respects (not crosses) the document boundaries and ignores 
paragraph and sentence boundaries. The consequence of such definition is the symmetry 
of the relatedness measure. 

2.3 Finding Collocations 

Collocations of a given word are statements of the habitual or customary places of that 
word [11]. We have already mentioned the need of exclusion of collocations from our 
data to not contaminate clusters of semantically related words. We use the standard 
method of Ml-score [12] to automatically identify the words that form a collocation. 
The only aspect of the process that is not routine is the extraction of three and more 
words collocations. It is implemented as a sequential process of piecewise formation 
of n + 1- word collocation from possible n-word collocations. Considering the huge 
amount of data we are dealing with (100 million words corpora) it is obvious that the 
process of more-words collocations retrieving is time and resource demanding. (That is 
why we have used the capacity of a super-computer). 

The side effect of collocation identification is the partial solution of the word sense 
ambiguity problem. As our method does not employ soft clustering (see below), the 
process is forced to decide to what cluster an ambiguous word will be adjoined. Applying 
collocation concept the word forming a collocation can belong to one cluster as a part 
of one particular collocation and to the other as a part of another collocation. 
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3 Arrangement of Experiments 

We have been experimenting with two different corpora - large English corpus containing 
about 121 mil. of words and large Czech corpus containing about 120 mil. words (exact 
numbers can be found in Table 1). Data have been processd to exclude functional words 
using stop-list. 



Table 1. Size of corpora used in experiments 



#of 


Czech English 


tokens 

types 

documents 


121,493,290 119,888,683 
1,753,285 490,734 

329,977 4,124 



The first step in the clustering process has been stemming or lemmatization (an 
assignment of the base form - lemma) of all word forms. The stemming algorithm for 
English can be implemented as a quite simple procedure which gives satisfactory results. 
On the other hand, lemmatization in the highly inflectional language like Czech needs 
carefully designed morphological analyzer. This effort is compensated by the reduction 
of items to be clustered (and therefore the time needed to process all data) and at the 
same time by the increase of occurrences of counted items and therefore by the increase 
of the statistical relevance of obtained data. 

In order to eliminate singularities in statistics and to reduce the total number of the 
processed bigrams of words, we have restricted input data in several ways. The context 
of each word is taken as a window of 20 words on both sides. The minimal frequency of 
base forms has been set to 20 and the minimal frequency of bigrams to 5. Table 2 depicts 
exact values obtained from the Czech corpus. 

Table 2. Statistics obtained from the Czech corpus 



|#of 1 


different lemmata 


1,071,364 


lemmata with frequency > 5 


218,956 


lemmata with frequency > 20 


95,636 


bigrams with frequency > 5 


25,009,524 


lemmata in bigrams with frequency > 5 


72,311 



The next task is to create lists of characteristic words for each context. The list of 
words sorted according to the decreasing Ml score is prepared for each word. The MI 
score is used only to this ordering, in the following steps the particular values of the 
score are not taken into consideration. The size of such lists is limited to 500 words. 

The calculation of distance between two words is motivated by the observation that 
semantically related words have similar characteristic lists. The difference of ranks for 
all the words from both lists is computed and 10 smallest differences are summed to 
form the distance. 
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4 Word Clustering 

Data clustering, also known as unsupervised classification, is a generic label for a variety 
of procedures designed to find natural groupings (clusters) in multidimensional data, 
based on measured or perceived similarities among the patterns [13]. Cluster analysis is 
a very important and useful technique which forms an active research topic. Hundreds 
of clustering algorithms that have been proposed in the literature can be divided into two 
basic groups - partitional clustering and hierarchical clustering. Partitional algorithms 
attempt to obtain a partition which minimizes the within-clnster scatter or maximizes the 
between-clnster scatter. Hierarchical techniques organize the data in a nested seqnence 
of gronps that can be displayed in the form of a dendrogram (tree) [14]. 

Partitional clnstering techniques are used more frequently than hierarchical tech- 
niques in pattern recognition. However, we argne that the number of clusters in the data, 
their shapes and sizes, depend highly on the particnlar application that should benefit 
from the clnstered data. As our aim is to find clustering of the large vocabulary that 
could be used in many successive natural language tasks and for varions application, the 
hierarchical techniqnes give more flexible ontpnts wifh universal (more general) usage. 
The weak point of this decision is the need of a heuristic to cut the dendrogram to form 
a partition required by a particular application. 

The basic families of hierarchical clustering algorithms are single-link and complete- 
link algorithms. The former ontputs a maximally connected subgraph, while the latter 
creates a maximally complete subgraph on the patterns. Complete-link clusters tend to 
be small and compact, on the other hand, single-link clusters easily chain together [14]. 
For our experiment, we have implemented a single-link clustering algorithm. The com- 
putational cost of this algorithm is acceptable even for the enormous number of words we 
are working with, contrary to the complete-link clustering algorithm that cannot directly 
benefit from the sorted list of distances and has to refresh the information about the 
distances each time a word is assigned to a clnster. The pseudo-code of the implemented 
algorithm can be seen in Figure 1 . 

As stated above, the hierarchical clustering algorithms ontpnt the dendrogram - 
a special type of tree depicting the iterative adjoining of words and clusters. A small 
subset of the dendrogram resnlting from our experiments with corpns data can be found 
in Figure 2. 

The final dendrogram has more than 40,000 nodes. As it is impossible to work with 
the whole tree in mannal linguistic exploration of the results, we have implemented 
a simple procedure that, traversing the dendrogram, answers the qnestion how a word 
is related to another one. Each particular line written by this procedure corresponds to 
the link of the dendrogram leading to a smallest partition of words covering both focus 
words. An example output of the procedure applied to words exhalation and cleanup is 
displayed in the Figure 3. 



5 Conclusions and Future Work 

This paper presents a procedure of fully automatic finding of semantically related words. 
We have demonstrated that it is possible to work with large portions of text (100 million 
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locateclust (id): 

Path t— 0 

while clusters[id] not closed: 

Path <— Path U {id} 
id cluster s[id] 

foreach i G Path: 
cluster s[i] <— id 

return id 

hierarchyO: 

foreach < rank,idl,id2 >€ sortbgr: 

cl t— locateclust (idl) 
c2 t— locateclust (id2) 
if cl jb c2: 

cluster s[c2] cl 

hierarchy[cl] hierarchy[cl] U {< c2,rank >} 
hierarchy[c2] hierarchy[c2] U {< cl,0 >} 
return hierarchy 



Fig. 1. Pseudo code of implemented clustering algorithm 




zamofeni 

contamination 



kontaminovat 

contaminate 



znecistit 

pollute 



znecisteni vycistem 

pollution cleanup 



spad vypar zamofit rozpoustedlo 

fallout exhalation contaminate solvent (noun) 



Fig. 2. An example of resulting dendrogram 



word corpora) and to find hierarchical partitioning of all reasonably frequent words. 
It is just this enormous size of the input corpus which is heside usually used methods 
that are applicable for toy-problems only. The amount of categorized words seems to be 
adequate for real applications, e.g. in the area of word sense disambiguation. 

The automatic evaluation of the whole result set of 40,000 basic word forms is 
not possible today as there are no domain oriented dictionaries covering a significant 
portion of the Czech language. However, the comparison of the resulting clustering in 
three particular domains (weather, finance and cookery) is in good agreement with the 
human linguistic intuition. 

The presence of polysemous and semantically ambiguous words poses the obstacle 
of any automatic word clustering. Our future effort will thus be focused on the correct 
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vypar/exhalation 

kontaminovat/contaminate (zamofem/contamination spad/fallout) 

znecistit/pollute zamofit/contaminate 

znecisteni/pollution 

rozpoustedlo/solvent(noun) 



vycistenl/cleanup 

Fig. 3. “Path” from word vypar/exhalation to vycistem/cleanup 



treatment of these words. One of the possible solutions could be the incorporation of 
a mixture-decomposition clustering algorithm. This algorithm assumes that each classi- 
hed pattern is drawn from one of the underlying populations (clusters) whose parameters 
are estimated from unlabelled data [15]. Mixture modeling allows soft membership that 
can be the answer to the semantic ambiguity problem. 

Another direction for the future work will be oriented to objectivize the quality of 
clustering results. At present, the only way to asses the quality of the implemented 
procedure output is the manual checking of the results. We would like to employ the 
information from different sources like machine readable dictionaries, WordNet [16] 
and other semantic nets, and parallel corpora to purify the process of the evaluation from 
the subjectivity aspects. 
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Abstract. This paper explores different strategies for extracting similarity rela- 
tions between words from partially parsed text corpora. The strategies we have 
analysed do not require supervised training nor semantic information available 
from general lexical resources. They differ in the amount and the quality of the 
syntactic contexts against which words are compared. The paper presents in details 
the notion of syntactic context and how syntactic information could be used to 
extract semantic regularities of word sequences. Finally, experimental tests with 
Portuguese corpus demonstrate that similarity measures based on fine-grained and 
elaborate syntactic contexts perform better than those based on poorly defined con- 
texts. 



1 Introduction 

The strategies for extracting semantic information from corpora can be roughly divided 
into two categories, knowledge-rich and knowledge-poor methods, according to the 
amount of knowledge they presuppose [4]. Knowledge-rich approaches require some 
sort of previously encoded semantic information [8,3,10]: domain-dependent knowl- 
edge structures, semantic tagged training copora, and/or semantic resources such as 
handcrafted thesauri: Roget’s thesaurus, WordNet, and so on. Therefore, knowledge- 
rich approaches may inherit the main shortcomings and limitations of man-made lexical 
resources: limited vocabulary size, since they can include unnecessary general words, or 
do not include necessary domain-specific ones; unclear classification criteria, since their 
word classification is sometimes too coarse and does not provide sufficient distinction 
between words, or is sometimes unnecessarily detailed; and, obviously, considerable 
time and effort required by building thesauri by hand. By contrast, knowledge-poor ap- 
proaches use no presupposed semantic knowledge for automatically extracting semantic 
information. These techniques can be characterised as follows: no domain- specific infor- 
mation is available, no semantic tagging is used, and no static sources as dictionaries or 
thesauri are required. They attempt to extract the frequency of co-occurrences of words 
within various contexts to compute semantic similarity among words. More precisely, 
the similarity measure takes into account the contexts that words share or do not share, 
as well as the importance of these contexts for each word. Words which share a great 
number of contexts are considered as similar. 

* This work was supported by the PRAXIS XXI project, Fundacao para a Ciencia e a Tecnologia, 
Ministerio da Ciencia e a Tecnologia, Portugal. 
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Since contexts can be defined in two different ways, two specific knowledge-poor 
strategies can also be distinguished; windows-based and syntactic-based techniques. 
Windows-based techniques consider an arbitrary number of words around a given word 
as forming its window, i.e., its context. The linguistic information about part-of-speech 
categories and syntactic groupings is not taken into account to characterise word con- 
texts [1,11]. The syntactic-based strategy, on the contrary, requires specific linguistic 
information to specify the word context. First, it requires a part-of-speech tagger for 
assigning a morphosyntactic category to each word of the corpus. Then, the tagged 
corpus is segmented into a sequence of phrasal groupings (or chunks). Finally, simple 
attachment heuristics are used to specify the relations between and within the phrasal 
groupings. Once the syntactic analysis of the corpus is reached, each word in the cor- 
pus is associated to a set of syntactic contexts. Then, a statistical method compares the 
frequency of the shared contexts to judge word similarity [5,6,2]. In both strategies, 
window-based and syntactic-based techniques, words will be compared to each other in 
terms of their contexts; yet, we consider that syntactic analysis opens up a much wider 
range of more precise contexts than does simple windows strategy. As syntactic contexts 
represent linguistic dependencies involving specific semantic relationships, they should 
be considered as fine-grained clues for identifying semantically related words. 

Since syntactic contexts can be defined in different ways, syntactic -based approaches 
can also be signihcantly different. Different pieces of linguistic information can be taken 
into account to characterise syntactic contexts^ . For instance, the information used by Lin 
[6] to dehne the notion of syntactic context is not the same than that used by Grefenstette 
[5]. Nevertheless, the choice of a particular type of syntactic context for measuring word 
similarity has not been properly justihed by those researchers. 

This way, the main objective of this paper is to analyse the appropriateness or the 
inadequacy of different types of syntactic contexts for computing word similarity. More 
precisely, various syntactic-based strategies will be compared on the basis of different 
definitions of the notion of syntactic context. For this purpose, we apply these strategies 
on the Portuguese corpus P.G.R? The article is organised as follows. In the next section, 
various types of syntactic contexts will be analised. Special attention will be paid to 
the notion of syntactic context used by Grefenstette, as well as for the specific notion 
that we have dehned. Then, in section 3.1, we will use the same statistical similarity 
measure to compare the appropriateness of the syntactic contexts defined in the previous 
section. The best results are obtained when the syntactic -based strategy relies on our 
notion of syntactic context. Samples of the results we have obtained are presented in the 
Appendix. 



2 Types of Syntactic Contexts 

In this section, we analyse the notion of syntactic context used by Grefenstette to compute 
similar words [5] . Then, we extract further syntactic information from the partially parsed 

* The choice of a particular measure of similarity may be another parameter to compare various 
syntactic-based approaches 

^ P.G.R. (Portuguese General Ashtorney Opinions) is constituted by case-law documents 

(http : coluna . di . f ct . uni .py~pgr). 
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text in order to make syntactic contexts more elaborate. As a consequence, we obtain 
fine-grained contexts which contain more specific information than the one provided by 
Grefenstette’s approach. 



2.1 The Notion of Attribute by Grefenstette 

Grefenstette calls “attributes” the syntactic contexts of a word. Attributes are extracted 
from binary syntactic dependencies between two words within a noun phrase or between 
the noun head and the verb head of two related phrases. A binary syntactic dependency 
could be noted: 

< i?, wl, w2 > 

where R denotes the syntactic relation itself (e.g., ADJ, NN, NNPREP, SUBJ, DOBJ, 
and lOBJ), and wl and w2 represent two syntactically related words. Table 1 shows 
some syntactic dependencies between the noun “cause” and other related words. 



Table 1. Exemples of syntactic dependencies 



Expressions 


Binary Dependencies 


possible causes 


<ADJ, cause, possible> 


the cause of neonatal jaundice 


<NNPREP, cause, jaundice> 


no cause could be determined 


<DOBJ, determine, cause> 


death cause 


<NN, cause, death> 



Then, for each word found in the text, the system selects the words that are syntac- 
tically related to it. The syntactically related words are considered the attributes of the 
given word, i.e., its syntactic contexts. For instance, a noun can be syntactically related 
to an adjective by means of the ADJ relation, to another noun by means of the NN and 
NNPREP relations, or to a verb by means of SUBJ, DOBJ, and lOBJ relations. These 
related words are taken to be the known attributes of the noun. 

In order to select the attributes of “cause”, the system takes as input all the binary 
dependencies between “cause” and other words. Then, it extracts all the specific words 
syntactically related to “cause”, since they represent its particular attributes. For example, 
from the 4 dependencies illustrated in 1 between “cause” and another word, it is possible 
to extract 4 attributes of “cause” (see table 2). 



Table 2. Attributes of cause 



Binary Dependencies 


Attributes of cause 


<ADJ, cause, possible> 


<possible> 


<NNPREP, cause, jaundice> 


<jaundice> 


<DOBJ, determine, cause> 


<DOBJ, determine> 


<NN, cause, death> 


<death> 
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In the Grefenstette’s notation, the attributes extracted from noun modifiers (namely 
NN, ADJ, and NNPREP modifiers) do not keep the name of the particular syntactic rela- 
tion. So, <jaundice>, <possible>, and <death> are attributes of “cause” even though 
the syntactic relations NNPREP, ADJ and NN are not explicitly represented. When 
extracting verbal complements, though, the specific syntactic relation is still available: 
<DOBJ, determine> is a verbal attribute constituted by both the word related to “cause” 
(i.e. the verb “determine”) and the specific syntactic relation DOBJ. 

2.2 Underspecified Attributes 

The notion of attribute defined in the previous section does not inherit all the available 
syntactic information from binary dependencies. Consider one of the Portuguese expres- 
sions found in our corpus: “autorizagao a empresa” (permission to the company). From 
this expression, <empresa> (company) is extracted as the attribute of “autorizaQao” 
(permission). Yet, relevant information implicitly contained in the dependency relation 
has been lost: 

- information about the specific preposition: the attribute <empresa> does not convey 
information about the particular preposition “a” relating the two words ; 

- information about the opposite attribute: the attribute <autorizagao> modifying the 
word “empresa” is not considered. 

Information about prepositions should be taken into account since they convey impor- 
tant syntactic and semantic information. Let’s consider two prepositional expressions: 
“autoriza 9 ao a empresa” (permission to the company) and “ autoriza 9 ao da empresa” 
(authorization by the company). According to the Grefenstette’s notion of attribute, we 
should extract the same attribute, namely <empresa>, from both expressions. Never- 
theless, preposition “a” (to) introduces a quite different syntactic dependency than the 
one introduced by preposition “de” (by). Whereas the preposition “a” requires <em- 
presa> to be the receiver within the action of giving authorazation, the preposition “de” 
requires <empresa> to be the agent of this action. Therefore, for the purpose of ex- 
tracting semantic regularities, prepositions should be considered as internal facets of 
syntactic contexts. 

From the Grefenstette’s viewpoint, only one attribute, <empresa> , could be ex- 
tracted from the NNPREP expression “ autoriza 9 ao a empresa”. Whereas the modifier 
word (i.e., the noun after the preposition) is considered as a potential attribute, the mod- 
ified word (i.e., the noun before the preposition) cannot become an attribute. The tests 
introduced in section 3.1 will show that the Grefenstette’s notion of attribute is too 
restrictive for the purpose of measuring word similarity. 

2.3 More Accurate Attributes for Word Similarity Measurement 

In order to take into account the implicit information contained in the dependency rela- 
tionships, we will introduce a more general and flexible definition of attribute. The results 
of the computational tests presented in the next section will provide us with empirical 
evidence about the appropriateness of such a definition. 
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Attributes are extracted from binary syntactic dependencies. A syntactic dependency 
may be represented as the following binary predication: 

w2^) 

this binary predication is constituted by the following entities: 

- the binary predicate r, wich can be associated to specific prepositions, subject rela- 
tions, direct object relations, etc. ; 

- the roles of the predicate, and which represent the modified and modifier 
roles, respectively; 

- the two words holding the binary relation: wl and w2. 

In this binary syntactic dependency, the word indexed by plays the role of modified, 
whereas the word indexed by plays the role of modifier. Therefore, wl is modified by 
w2 as well as w2 modifies wl. This way, two complementary attributes may be extracted 
from that syntactic dependency: 

< j,r, ml >< tA w‘2 > 

where < j,r, wl > is the attribute of w2 and < \r,w2 > is the attribute of wl. An 
attribute is defined as the pair constituted by both a specific syntactic function and the 
word associated to this function. In particular, j,r represents the syntactic function of 
modified, and 'fr the modifier function. Consider Table 3. The left column contains ex- 
pressions constituted by two words syntactically related by a particular type of syntactic 
dependency. The right column contains the attributes extracted from these expressions. 
For instance, from the expression “autoriza^ao a empresa”, it was extracted both the 
attribute <fa, empresa>, where “empresa” plays the role of modifier word, and the 
attribute <j,a, autorizagao> , where “autorizagao” is the modified word. Furthermore, 
information about the specific preposition connecting the words is also available. Our 
notion of attribute is closely related to what Lin calls “feature” [6]. 



Table 3. Elaborate attributes 



Binary Expressions 


Attributes 


autorizagao f empresa 
(permission to the company) 


empresa>, <j,a, autorizagdo> 


nomeapao do presidente 
(appointment of the president) 


<fde, presidente>, <fde, nomea(da> 


nomeou o presidente 
(appointed the president) 


<fdobj,presidente>, <fdobj nomear> 


discutiu sobre a nomeapao 
(disscussed about the appointment) 


<j'so&re, nomeagdo>, <j,so&re, discutir> 



These elaborate attributes provide us with fine-grained syntactic contexts. In the 
following section, we will compare these informative syntactic contexts to the coarse- 
grained contexts used by Grefenstette. This will lead us to assume that the elaborate 
information conveyed by our notion of attribute is able to contribute more accurately to 
design a suitable strategy for clustering similar words. 
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3 Comparing Syntactic-Based Strategies 

Various semantic extraction techniques were applied to the Portuguese corpus P.G.R. 
(Portuguese General Attorney Opinions), which is constituted by 1,678,593 word oc- 
currences. These techniques use the output obtained from two previous pre-processing 
steps. First, the corpus was tagged by the part-of- speech tagger presented in [7]. Then, 
it was analysed by the partial parser presented in [9] . Similarity was computed by mea- 
suring the syntactic information shared by 4,276 different nouns on the basis of 186,952 
different attributes. 



3.1 The Weighted Jaccard Similarity Measure 



To compare the syntactic contexts of two words, we used as similarity measure a weighted 
version of the binary Jaccard measure [5].^ The binary Jaccard measure, noted BJ, 
calculates the similarity value between two words, m and n, by comparing the attributes 
they share and do not share: 



BJ{wordrm wordn) 



\{wordm attributes fl wordn attributes}\ 
\{wordjn attributes U wordn attributes'\\ 



The weighted Jaccard measure considers a global and a local weight for each attribute. 
The global weight gw takes into account how many different words are associated with 
a given attribute. It is computed by the following formula: 



gw (^attribute j) 



1-E 

i 



iKj-logafaj)! 
log2 (nr e/s) 



where 



freguency of attribute j with wordi 
total number of attributes for wordi 

and nrels is the total number of relations extracted from the corpus. The local weight 
Iw is based on the frequency of the attribute with a given word, and it is calculated by: 



lw{wordi, attributej) = log2 (/reguency of attribute^ with wordi) 



The whole weight w of an attribute is the multiplication of both the global and the 
local weights. So, the weighted Jaccard similarity WJ between two words m and n is 
computed by: 



W J{wordm, wordn) 



imn(^w(^wordrn, attributej), w{wordn, attributej)) 
m&yfw{wordrn, attributej), w{wordn, attributej)) 



By computing the similarity measure of all word pairs in the corpus, we extracted 
the list of the most similar words to each word in the corpus. This process was repeated 

^ We implemented various statistical measures: coefficient of Jaccard, a specific version of the 
weighted Jaccard, and the particular coefficient of Lin. They did not improve, though, the results 
obtained from the weighted Jaccard measure described in this section. 
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considering different types of syntactic contexts. On the one hand, we tested the rele- 
vance of the use of the prepositional information for the attributes’ dehnition. For this 
purpose, we compared the results obtained from two strategies: “-|-prep-strategy” and 
“—prep-strategy”. The former uses attributes containing information about the specific 
prepositions, while the latter does not use that information. On the other hand, we tested 
the adequacy of the “4,-attributes” extracted from prepositional dependencies between 
two noun phrases. For this purpose, we also compared two different methods: “ti- 
strategy” and “f-strategy”. The former contains both types of attributes, while the later 
uses only f-attributes. 

3.2 +prep-strategy versus -prep-strategy 

We tested first the contribution of the specihc prepositions to measure word similarity. 
The results obtained from both strategies, +prep- strategy and — prep-strategy, showed 
that there are no great differences for the words sharing a great number of attributes 
(namely, more than 100 attributes). That is, the results are not signihcatively different 
for words frequently appearing in the corpus. Table 4 compares the results obtained from 
the +prep- -strategy to those obtained from the — prep-strategy, retaining at random 
some of the words sharing more than 100 attributes."^ 



Table 4. Similarity lists of frequently appearing words (> 100 attributes) produced by contexts 
with and without prepositional information 



Word 


Cluster of s 

-fpr ep-strategy 


imilar words 

— prep-strategy 


presidente 

(president) 


secretario, membro, director, provedor 
(secretary, member, director, purveyor) 


director, secretario, procurador, membro 
( director, secretary, attorney, member) 


lei 

(law) 


artigo, decreto, diploma, norma 
(article, decree, diploma, norm) 


artigo, decreto, n, norma 
(article, decree, n, norm) 


estado 

(state) 


administra^ao, ministerio, pessoa, govemo 
(administration, state department, person, government) 


ministerio, administra9ao, govemo, tribunal 
(state department, administration, government, tribunal) 


diploma 

(diploma) 


decreto, lei, artigo, regulamento 
(decree, law, article, regulation) 


decreto, lei, artigo, norma 
(decree, law, article, norm) 



Nevertheless, when the words sharing less than 100 attributes (in fact, the most 
abundant in the corpus) was compared, we observed that the lists obtained from the 
-fprep-strategy are semanticcally more homogeneous than the lists obtained from the 
— prep-strategy. Table 5 shows some of the lists yielded by these strategies for less 
frequently appearing words. 

These results deserve special comments. Let’s take the lists obtained from the word 
“tempo” (time). In the -hprep-strategy, the attribute <],por, contmto> (<\.hy, con- 
tmct>) is shared by “tempo” and “anos” (years). As its global weight is quite high 

We do not use a systematic evaluation methodology based on machine readable dictionaries 
or electronic thesaurus, because this sort of lexical resources for Portuguese are not available 
yet. 
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Table 5. Similarity lists of less frequently appearing words (< 100 attributes) produced by contexts 
with and without prepositional information 



Word 


Cluster of s 
-fprep-strategy 


limilar words 

— prep-strategy 


tempo 

(time) 


data, momento, ano, antiguidade 
(date, moment, year, seniority) 


decada, presidente, admissibilidade, problematica 
(decade, president, admissibility, problem) 


regulamento 

(regulation) 


estatuto, codigo, diploma, decreto 
(statute, code, diploma, decree) 


membro, decreto, piano, acto 
(member, decree, plan, act) 


organismo 

(organization) 


autarquia, comunidade, orgao, xxx227 gabinete 
(county, community, organ, cabinet) 


coordenagao, dgpc, unidade, ipa 
(coordination, dgpc, unit, ipa) 


finalidade 

(finality) 


objectivo, escope, fim, objecto 
(goal, scope, aim, object) 


capacidade, campo, financiamento, publicidade 
(ability, domain, financing, advertising) 


fim 

(aim) 


objectivo, finalidade, resultado, efeito 
(goal, finality, result, effect) 


decurso, resultado, alvara, apresenta§ao 
(duration, result, charter, introduction) 


conceito 

(concept) 


no^ao, regime, estatuto, conteudo 
(notion, regime, statute, content) 


correspondencia, grupo, presidente, tematica 
(correspondence, group, president, subject) 


area 

(area) 


ambito, materia, processo, sector 
(range, matter, process, sector) 


meio, vista, macau, construc 9 ao 
(mean, view, macau, construction) 



(0, 78), this attribute make the two words more similar. On the contrary, in the —prep- 
strategy, the no-prepositional attribute <4,, contrato> has a very low weight: 0, 04. Such 
a low value makes the attribute not significant when computing the similarity between 
“tempo” and “anos”. 

Let’s consider another example: the lists obtained from the word “area” (area). In 
the +prep-strategy, the attribute <4,em, actividade> (<4,m, activity>) is shared by 
the words “area”, “ambito” (range) and “sector” (sector). Given that its global weight 
is very high (0, 94), it contributes to make these three words semantically close. On 
the contrary, in the — prep-strategy, the no-prepositional attribute <4,, actividade> has 
a lower weight: 0, 37. Consequently, it cannot be considered as a significant clue when 
comparing the similarity between the three words. 

Therefore, it can be assumed that the information about specific prepositions is 
relevant to characterise and identify the signihcant syntactic contexts used for the mea- 
surement of word similarity. 

3.3 ti-strategy versus t-strategy 

We also tested the contribution of the 4,-attributes (extracted from noun phrases) to yield 
lists of similar words. The lists obtained from fi-strategy are significantly more accurate 
than those obtained from the f-strategy, even for the frequently appearing words such as 
“diploma” (diploma) or “decreto” (decree). Table 6 illustrates some of the lists extracted 
from both strategies. 

On the basis of the results illustrated above, it can be assumed that the use of 4-- 
attributes to yield lists of similar words is extremely significant. Indeed, this type of 
attributes somehow provides information concerning the semantic word class. Consider 
the 4,-attributes <\.de, cap(tulo> (<4.o/, chapter>), <la, anexo> (<Xto, attatched>) 
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Table 6. Similarity lists produced by contexts with and without 4,-attributes 



Word 


Cluster of 

tf-strategy 


similar words 

f-strategy 


juiz 

(judge) 


diiigente, presidente, subinspector, governador 
{leader, president, subinspector, governor) 


contravengao, vereador, recinto, obrigatoriedade 
{contravention, councillor, enclosure, obligatoriness) 


diploma 

{diploma) 


decreto, lei, artigo, conven9ao 
{decree, law, article, convention) 


tocante, diploma, magistrado, visita 
{concerning, diploma, magistrate, visit) 


decreto 

{decree) 


diploma, lei, artigo, n 
{diploma, law, article, n) 


ambos, sessao, secretaria, coadjuvagao 
{both, session, department, cooperation) 


regulamento 

{regulation) 


estatuto, codigo, sistema, decreto 
{statute, code, system, decree) 


membro, meio, prejuizo, emissao 
{member, mean, prejudice, emission) 


regra 

{rule) 


norma, principio, regime, legislagao 
{norm, principle, regime, legislation) 


lugar, data, causa, momento 
{location, date, cause, moment) 


renda 

{income) 


cau9ao, indemniza^ao, reintegragao, multa 
{guarantee, indemnification, reimbursement, fine) 


fornecimento, instala^ao, aquisi^ao, construcgao 
{suply, instalation, acquisition, construction) 


conceito 

{concept) 


no^ao, estatuto, regime, tematica 
{notion, statute, regime, subject) 


grau, tipicidade, teatro, abordagem 
{degree, typicality, theater, approach) 



and <lde, conteudo> (<lof, content>), shared by the words “decreto” and “diploma”. 
As those attributes require nouns denoting the same class, namely documents, they can 
be conceived as syntactic patterns imposing the same selectional restrictions to nouns. 
Consequently, the nouns appearing with those specific 4,-attributes should belong to the 
class of documents. 

In the Appendix, we compare the lists extracted by using the fine-grained tech- 
niques (i.e., both ti-strategy and -fprep-strategy) to the lists extracted by using the 
coarse-grained methods: f-strategy and -prep-strategy. The words constituting the 
lists obtained from the more informative strategies are semantically more homogeneous 
than those obtained from the less informative ones. 

4 Final Remarks 

According to the results of the tests described above, the strategies based on rich syn- 
tactic contexts are more accurate for the measurement of similar words. Indeed, the 
specific syntactic data that we used to refine syntactic attributes allows us to identify 
more informative syntactic-semantic dependencies between words. Experimental tests 
demonstrated that similarity measures relied on the fine-grained syntactic contexts we 
have defined in this paper perform better than those based on poorly defined contexts. 

Nevertheless, all the syntactic-based approaches (fine-grained and coarse- grained 
syntactic strategies) are confronted with two sorts of linguistic phenomena: both poly- 
semic words and odd attachments of syntactic dependencies between words. 

The lists of words recognised as being similar to a particular word can be semantically 
heterogeneous because of the lexical polysemia of the compared word. For example, the 
word “contrato” {contract) appears to be similar to words describing activities such as 
“trabalho” {work), “acto” {act), “processo” {process), as well as to words describing 
documents such as “norma”, “regulamento”. It shares with the former group attributes 
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requiring a sort of activity (e.g., <\.de, execugdo> (<|o/, execution>), <]^de, con- 
clusao> (<iof, end>)), and shares with the later group attributes requiring documents: 
<\.dohj, assinar> {<\.dohj, sign>). Various clustering proposals have been made so as 
to group the similar words together along sense axes [5,2]. However, the implementa- 
tion of a efficient clustering method used for the groupping of semantically homogenous 
words remains to be future work. 

Finally, attachment errors inherited from the parser should be taken into account. 
The number of these errors increases when the language analysed is not as syntactically 
rigid as English. When considering our Portuguese corpus, we found that of a sample 
of analysed text more than 30% were incorrect attachments. This weak rate is probably 
due to the non predictable constraints on the Portuguese syntactic order. To palliate the 
noisy information inherited from the parser limitations for Portuguese, we are impelled 
to find more robusf word similarify strafegies fhan those used for English texts. So, the 
fine-grained strategies defined in this paper could be perceived as an attempt to partially 
palliate the poor results obtained by parsing Portuguese text corpora. 
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Abstract. In this paper, we introduce a web-based integrated text and knowledge 
mining aid system in which information extraction and intelligent information 
retrieval/database access are combined using term-oriented natural language tools. 
Our work is placed within the BioPath research project whose overall aim is to link 
information extraction to expressed sequence data validation. The aim of the tool 
is to extract automatically terms, to cluster them, and to provide efficient access 
to heterogeneous biological and genomic databases and collections of texts, all 
wrapped into a user friendly workbench enabling users to use a wide range of 
textual and non textual resources effortlessly. For the evaluation, automatic term 
recognition and clustering techniques were applied in a domain of molecular 
biology. Besides English, the same workbench has been used for term recognition 
and clustering in Japanese. 



1 Introduction 

The increasing production of electronically available texts (either on the Web or in other 
machine-readible forms such as digital libraries and archives) demands for appopraite 
computer tools that can perform information and knowledge retrieval efficiently. The size 
of knowledge in some domains (e.g. molecular biology, computer science) is increasing 
so rapidly that it is impossible for any domain expert to assimilate the new knowledge. 
Vast amounts of knowledge remain unexplored and this poses a major handicap to a 
knowledge intensive discipline. 

Information retrieval (IR) either via keywords or via URL links have been used 
intensively to navigate through the WWW in order to locate relevant knowledge resources 
(KSs). While URLs can be specified in advance by the domain specialists, like links in 
hypertexts, IR via keywords can locate relevant URLs (and thus KSs) on the fly. URLs 
specified in advance are more effective in locating relevant KSs, but they cannot cope 
with the dynamic and evolving nature of KSs over the WWW. On the other hand, links 
using keywords, like in a typical IR system, can certainly cope with the dynamic nature 
of KSs in the WWW by computing links on the fly, but this technique often lacks the 
effectiveness of the direct links via URLs, as users are often forced to make tedious trials 
in order to choose the proper sets of keywords to obtain reasonably restricted sets of 

* This research is supported by LION BioScience, http : / /www . lionbioscience . com 
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KSs. This is a well-known problem of WWW querying techniques and the techniques 
that combine the advantages of these two approaches are needed. Furthermore, since 
the URLs are often too coarse to locate relevant pieces of information, users have to go 
through several stages of information seeking process. After identifying the URLs of the 
KSs that possibly contain relevant information, they have to locate the relevant pieces of 
information inside the KSs by using their own navigation functions. This process is often 
compounded by the fact that users’ retrieval requirements can only be met by combining 
pieces of information in separate databases (or document collections). The user has 
to navigate through different systems that provide their own navigation methods, and 
has to integrate the results by herself/himself. An ideal knowledge-mining aid system 
should provide a seamless transition between the separate stages of information seeking 
activities. 

The ATRACT system, introduced in this paper, aims at this seamless navigation 
for the specific domain of molecular biology. It is ’term-centered’, as we assume that 
documents are characterized by sets of technical terms which should be used as keywords 
for retrieval. Therefore, the very first problem to address is to recognise terms. 

The paper is organised as follows: in section 2 we briefly overview ATRACT, and 
in section 3 we present the design of the system. In the next section we present an 
analysis and evaluation of our experiments conducted on corpora in the domain of 
nuclear receptors, and results conducted on a Japanese corpus. 

2 ATRACT: An Integrated Term-Centered, Text Mining System 

ATRACT (Automatic Term Recognition and Clustering for Terms) is a part of the 
ongoing BioPath' project [1]. The goal of the project is to develop software components 
allowing the investigation and evaluation of cell states on the genetic level according 
to the information available in public data sources, i.e. databases and literature. The 
main objective of ATRACT is to help users’ knowledge mining by intelligently guiding 
the users through various knowledge resources and by integrating data and text mining, 
information extraction, information categorization and knowledge management. 

As in traditional keyword based document retrieval systems, we assume that docu- 
ments are characterized by sets of terms which can be used for retrieval. We differentiate 
between index terms and technical terms, and in this paper we are referring to technical 
terms i.e. the linguistic realisation of specialised concepts. In general, technical terms 
represent the most Important concepts of a document and characterize the document 
semantically. We also consider contextual information between a term and its context 
words, since this information is important for improvement of term extraction, term 
disambiguation and ontology building. 

A typical way of navigating through the knowledge resources on the WWW via 
ATRACT is that a user whose interest is expressed by a set of key terms retrieves a set 
of documents (e.g. from the MEDLINE database [7]). Then, by selecting the terms that 
appear in the document, s/he retrieves fact data from different databases in the WWW. 
Which databases have to be accessed should be determined automatically by the system. 

* BioPath is a collaborative EUREKA research project coordinated by LION BioScience and 
ValiGen. 
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In order to implement the term-centered navigation described above, we have to deal 
with the following problems: 

— Term recognition. In specialized fields, there is an increased amount of new terms 
that represent newly created concepts. Since existing term dictionaries cannot cover 
the needs of specialists, automatic term extraction tools are needed for efficient term 
discovery. 

— Selection of Databases. There is a multitude of databases accessible over the WWW 
dealing with biological and genomic information. For one type of information there exist 
several databases with different naming conventions, organisation and scope. Accessing 
the relevant database for the type of information we are seeking is one of the cricial 
problems in molecular biology. Once the suitable database(s) is found, there is the 
difficulty to discover the query items within the database, as well. In addition, naming 
conventions in many domains (especially in molecular biology) are highly ambiguous 
even for fundamental concepts (e.g. ’tumor’ can correspond to either a decease, or the 
mass of tissue; on the other hand, ’TsaB’ is a protein, and ’tsaB’ is a gene), which effects 
selection of appropriate databases. 

ATRACT aims to provide solutions to the problems described above by integrating 
the following components: automatic term recognition, context-based automatic term 
clustering, similarity-based document retrieval, and intelligent database access. 



3 ATRACT System Design 



The ATRACT system contains the following components (see Figure 1): 

(1) Automatic Term Recognition (ATR). The ATR module recognizes terms included 
in HTML/XML documents using the C/NC-value method [2], though any method of 
term recognition can be used. C/NC-value method recognizes term candidates on the fly 
from texts which often contain unknown or new terms. This method is a hybrid approach, 
combining linguistic knowledge (term formation patterns) and statistics (frequency of 
occurrence, string length, etc). C/NC-value extracts multi-word terms and performs 
particularly well in recognizing nested terms i.e. sub-strings of longer terms. One of 
the innovative aspects of NC-value (in addition to the core C-value method) is that it is 
context sensitive. In a specific domain, lexical preferences of the type of context words 
occurring with terms are observed [5], [6]. The incorporation of contextual information 
is based on the assumption that lexical selection is constrained in sublanguages and 
that it is syntactified. The user can experiment with the results of the ATR module by 
tuning parameters such as threshold value, threshold rate, weights, selection of part-of- 
speech categories, choice of linguistic filters, number of words included in the context 
etc. according to his/her specific interesfs. 

(2) Automatic Term Clustering (ATC). Contextual clustering is beneficial for resolving 
the terminological opacity and polysemy, common in the field of molecular biology. 
Table 1, for example, shows problems of term ambiguity in the field. The same terms 
which are fairly specific and domain dependent still have several different meanings, 
depending on the actual context in which these terms appear. This means that, depending 
on the context, we have to refer to different databases to retrieve fact data of these terms. 
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Fig. 1. System design of ATRACT 



The ATC module classifies terms recognized by the ATR module based on contextual 
clustering and statistical techniques. It is an indispensable component in our knowledge- 
mining system, since it is useful for term disambiguation, knowledge acquisition and 
construction of domain ontology. The approach is based on the observation that terms 
tend to appear in close proximity with terms belonging to the same semantic family [5] . If 
a context word has some contribution towards the determination of a term, there should be 
a significant correspondence between the meaning of that context word and the meaning 
of the term. Based on that observation, we compare the semantic similarities of contexts 
and terms. The clustering technique is based on automatically deriving a thesaurus based 
on the AMI (Average Mutual Information) hierarchical clustering method [12]. This 
method is a bottom-up clustering technique and is built on the C/NC-value measures. As 
input, we use bigrams of terms and their context words, and the output is a dendrogram 
of hierarchical term clusters. 

(3) Similarity-based Document Retrieval (DR). DR is a VSM (vector space model)- 
type document retrieval module. It retrieves texts associated with the current document, 
allowing the user to retrieve other related documents by assigning selected keywords 
and/or documents using similarity-based document retrieval. The user can also retrieve 
documents by specifying keywords. 
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Table 1. Term ambiguity 



term 


protein 


enzyme 


compound 


amino acid 


+ 


- 


- 


amino acid sequence 


+ 


- 


+ 


pyruvate dehydrogenase 


+ 


+ 


+ 


pyruvate carboxylase 


+ 


+ 


+ 



(4) Intelligent Database/item Selection (IDS) using database (meta-) ontology. IDS 

selects the most relevant databases and their items using term class information as- 
signed by ATC module and database’s (meta-)ontology information. All terms are ’click- 
able’ and dynamically ’linked’ to the relevant databases over the Internet. The relevant 
databases should be dynamically selected according to the terms and the term hierarchy 
information. The module is implemented as an HTTP server, designed to choose the 
appropriate database(s) and to focus on the preferred items in the database(s) according 
to the user’s requirements. The most relevant databases are determined automatically by 
calculating association scores between the term classes and the description of databases 
(such as meta-data). Furthermore, the retrieved content can be modified in order to focus 
on the most pertinent items for the user. It is also possible to show similar entries by 
calculating similarities using the term classes and the domain specific ontology when an 
exact matched entry is not found. 

4 Experiments and Evaluation 

We conducted experiments to confirm the feasibility of our proposed workbench. The 
evaluation was performed on 2,000 MEDLINE abstracts [7] in the domain of nuclear 
receptors (for English), and on a NACSIS Japanese Al-domain corpus [4]. We focused 
on the quality of automatic term recognition and similarity measure calculation with the 
use of automatically clustered terms, as all other techniques are based on term extraction. 
— Term recognition. We have examined the performance of the NC-value method with 
respect to the overall performance from the viewpoint of precision and recall by 1 1 -point^ 
score, while applying it to the same corpus and the correction set to the C-value. The 
top of the list produced by C-value (the first 20% of extracted candidate terms) was used 
for the extraction of term context words, since these show high precision on real terms. 
We used 30 context words for all the extracted terms in the evaluation, the number been 
determined empirically. 

Figure 2 (left) shows the 11 -point precision-recall score of NC-value method in 
comparison with the corresponding C-value for English. It can be observed that NC- 
value increases the precision compared to that of C-value on all the correspond points for 
recall. Similarly, NC-value increases the precision of term recognition compared to pure 
frequency of occurrence. Although there is a small drop in precision compared to C- 
value in some intervals (Figure 2, right), NC-value generally increases the concentration 
of real terms at the top of the list. 

^ 1 1-point score indicates that, for example, precision at recall 0.10 is taken to be maximum of 
precisions at all recall points greater then 0.10. 
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Fig. 2. 1 1-point score (left) and interval precision (right) for English 



C/NC-value method was applied on a collection of NACSIS Al-domain texts in 
Japanese [4], as well. As one can see on the Figure 3, the results are similar to those 
obtained for English. Although the same technique is used, different linguistics filters 
were defined in order to describe term patterns in Japanese [9]. 




Fig. 3. 1 1 -point score (left) and interval precision (right) for Japanese 



— Clustering terms and database handling. We used the similarity measure calcu- 
lation as the central computing mechanism for choosing the most relevant database(s), 
determining the most preferred item(s) in the database(s), and disambiguating term pol- 
ysemy. The clustered terms were developed by using Ushioda’s AMI-based hierarchical 
clustering program [12]. As training data, we have used 2,000 MEDLINE abstracts. 
Similarities between terms were calculated according to the hierarchy of the clustered 
terms. In this experiment, we have adopted a semantic similarity calculation method for 
measuring the similarity between terms described in [1 1]. We have used three sets (DNA, 
PROTEIN, SOURCE) of manually classified terms and calculated the average similarities 
( AN) of every possible combination of the term sets, that is, y) = -E sim{x,y), 

where X and Y indicate each set of the classified terms; sim{x, y) indicates similarity 
between terms x and y, and n indicates the number of possible combinations of terms 
in X and Y (except the case where x = y). As the Table 2 shows, each AS between 
the same class terms, i.e. AS{X, X), is greater than the others respectively. We believe 
that it is feasible enough to use automatically clustered terms as the main source of 
knowledge for calculating similarities between terms. However, despite these results on 
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clustering and disambiguation, searching for suitable databases on the Web still remains 
a difficult task. One of the main problems encountered is that we are not certain which 
databases have items which best describe our request, i.e. we do not know whether the 
related databases are pertinent to our request. In addition, the required information is 
sometimes distributed into several databases, i.e. almost all databases are disparate in 
terms of information contained. 



Table 2. Average similarities 





DNA 


PROTEIN 


SOURCE 


# of terms 


DNA 


0.533 


- 


- 


193 


PROTEIN 


0.254 


0.487 


- 


235 


SOURCE 


0.265 


0.251 


0.308 


575 



5 Conclusion and Further Research 

In this paper, we have presented ATRACT, a web-based integrated text and knowledge 
mining workbench. ATRACT extracts automatically terms based on a combination of 
linguistic and statistical knowledge, clusters terms and provides seamless navigation and 
access to heterogeneous databases and collections of texts. The workbench provides a 
user with a friendly environment for term extraction and clustering from a variety of 
knowledge and textual sources. The system enables a logical integration of databases on 
the Web: the design allows the users to refer to the required items gathered from several 
web databases as if they access certain sophisticated single database virtually. 

Important areas of future research will involve improvement of term recognition us- 
ing semantic/clustered term information and additional syntactical structures (e.g. term 
variants and coordination [3], [10]), and improvement of database handling. Since our 
goal is dealing with the ’open world’ of databases, due to insufficiency of information on 
what sort of data is contained in each database, selecting the most associative databases 
is one of the crucial problems. Therefore we will have to resolve problems of choos- 
ing database(s) from large amounts of databases on the Web (and to recognise newly 
launched databases as well), and to modify the ’view’ of each database according to the 
requirements (since the format styles vary from site to site). We expect that meta-data in- 
formation could be useful to select database(s) if enough meta-data about each database 
is available on the Web. Regarding the format style of each database, we expect that the 
popularity of XML might be a solution of the problem. 
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Abstract. Emerging electronic text formats include hierarchical structure and vi- 
sualization related information that current Text-to-Speech (TtS) systems ignore. 
In this paper we present a novel approach for composing detailed auditory repre- 
sentation of e-texts using speech and audio. Furthermore, we provide a scripting 
language (CAD scripts) for defining specific customizations on the operation of a 
TtS. CAD scripts can be assigned as well to specific text meta-data to enable their 
discrete auditory representation. This approach can form a mean for a detailed 
exchange of functionality across different TtS implementations. Moreover, it can 
be hosted to current TtS systems with minor (or major) modifications. Finally, we 
briefly present the implementation of DEMOSTHeNES Composer for augmented 
auditory generation of meta-text using the above methodology. 



1 Introduction 

In the near past, the research focus of Text-to-Speech (TtS) systems, concerning e-text 
handling, was mainly the extraction of linguistic information from rather plain e-texts in 
order to render prosody. During the last years, the format and the visual representation of 
e-texts have changed, introducing emerging issues in speech generation. Various e-text 
types (like HTML, MS- Word and XML) provide on the one hand text information in well- 
formed hierarchical structures and on the other optional information about the way the 
text will be visualized. Furthermore, in many other cases e-text has not a continuous flow 
with a proper syntactical structure, as the use of written telegraphic style has increased 
within documents. For example, an HTML document contains titles and legends in 
buttons and photos. Finally, a huge amount of e-text is stored in databases. Queries to 
these databases return different type of texts that have well-defined data structures (e.g. 
tables) but they lack syntactical structure. 

Thus, e-text types described above carry a fair amount of text meta-data that is not 
related with semantics or pragmatics. These text meta-data are defined as meta-text. 
Therefore, meta-text can be format or visual oriented and lately speech oriented. 

Emerging speech mark-up languages (like VoiceXML[l] and STML[2]) facilitate 
the transfer of speech information, by providing means (tags) for defining high-level 
prosodic behavior (or alternatively detailed low-level prosody in terms of pitch and 
time) and audio insertions. 
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However, the above speech related mark-up languages do not support adequately 
meta-text in order to fit appropriately to a TtS system. Firstly, the majority of exist- 
ing e-text types are not in one of the above speech-aware formats. Secondly, even in 
speech mark-up languages there is not any kind of facilities to represent non-speech 
tags. Moreover, there is not yet a standardized method on interpreting speech tags and 
as a consequence this depends heavily on the view of each TtS system. And finally, fhe 
use of low-level prosodic description though detailed, is not optimal for cross-language 
use of the document (consider the case of the translation of an English text, formatted in 
VoiceXML document with detailed low-level pitch curve description, to a Czech one). 

In this paper we present a novel approach for composing detailed auditory represen- 
tation of e-texts using speech and audio. After the formulation of the major requirements 
for auditory representation of meta-text, we describe a general methodology for e-text 
to speech and audio composition. Next, we provide a scripting language (CAD scripts) 
for defining specific customizations on fhe operation of a TtS system. CAD scripts can 
be assigned as well to specific fexf mefa-dafa fo enable their discrete auditory represen- 
tation. Finally, we briefly present the implementation of DEMOSTHeNES Composer 
for augmented auditory generation of meta-text using the above methodology. 



2 Requirements for Auditory Representation of Meta-text 

We have performed a short experiment to measure how people handle meta-text using 
their speech during reading out. Ten people were asked to read loudly the following 
document, as it is presented on a browser, to see how they will handle the structures 
involved using their speech. 

<title> Speech Synthesis for Blind Persons</title> 

<hl>Tools that convert text information to speech</hl> 

<p>Text Readers enable users to <b>hear</b> - rather than read - 
texts from a computer . </p> 

<table><tr> 

<td>Institute</td><td>Description</td> 

</tr> 

<tr> 

<td>University of Athens, Department of Informatics</td> 
<td>DEMOSTHeNES Composer <a href = 

"http : //WWW . di .uoa . gr/ speech /synthesis /demo sthenes"> Click 
here .</ax/td> 

</trx/table> 



The results differed for each person, mainly because, for example, everyone tried 
to invent a way to describe the table structure. Some of them completely ignored the 
structure showed on the screen and read it without any variation in their speech. However, 
this document has a strongly defined hierarchical sfructure. If we read if as being a single 
line, we will hide all the structural information that it embeds. It is like showing this 
document in a browser without its format, but in a line. The latter additionally implies 
that the document will miss some of its meaning as well. 

Visualization can provide some further information to support the vocalization of the 
text (e.g. bold letters usually imply emphasis). For enhancing the audible representation 
of e-texts, meta-text requires a different manipulation than ordinary text. Several works 
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have looked into ways to enhance the presentation of structures, such as tables, lists, 
etc., in cases of auditory-only interfaces [3]. They focus on the insertion of non-speech 
audio signals [4] such as beeps and on prosodic variations and speaker-style change [5]. 
Especially, the latter work shows how non-lexical cues inserted in a lexical string can 
enhance presentations with limited or no visual support. 

An efficient audible representation of e-text requires the combination of synthetic 
speech, appropriately formed prosodic events and non-speech sounds. In order for a TtS 
system to accommodate these requirements, it should provide: 

1. A mechanism for identifying meta-text in a document. 

2. The appropriate data types and services for storing and manipulating the meta-text. 

3. An efficient prosody and audio generator closely coupled with the meta-text. 

Though the first requirement can be hosted in current TtS synthesizers in the form of an 
add-in module, the second and the third need major restructuring to be done, as there is 
not any provision for meta-text. 

In order to describe the variety of information carried by meta-text that take place 
during speech generation and also the insertion of other non-speech sounds, we introduce 
the notion of composing rather than synthesizing when referring to such systems. Our 
approach essentially suggests a sub-system that enables the conversion to speech and 
audio from e-text. We call it e-TSA composer, or simply “composer” for the purposes of 
this paper. It applies to almost any open and modular TtS system with minor or major 
modifications (e.g. [6] and [7]). The e-TSA composer cooperates with the underlying 
TtS system (simply referred from now on as “TtS”). 

3 E-text to Speech & Audio Composer 

Figure 1 presents the architecture and the flow of information in the e-TSA composer we 
propose. It consists of five major components (the “E-text Adapter”, the “Transformer”, 
the “Clusters Auditory Definition”, the “Modules properties” and the “Composer”) and 
a set of modules (for example the “HTML” and “Word” e-text adaptation modules) 
that customize its functionality. Black shapes (“Composer” and the text, speech and 
audio related modules) indicate strong coupling with the underlying TtS. The reasons 
for choosing this scheme accompanied by XML were mainly two: (a) the ability to 
hierarchically represent any type of meta-text and (b) the flexibility in writing scripts to 
customize the TtS functionality and vocalize e-texts. 

The e-TSA composer is based on the extended Markup Language (XML) and its 
transformation capabilities. XML is an emerging document type that offers a text-based 
format for storing and transferring data. 

The three main stages (namely “E-text to cXML”, “cXML to ciXML” and “ciXML 
to S&A”) will be described in the next paragraphs. 

In [8] and [9], the XML has been used as a mean for internal representation and 
processing of hierarchical linguistic structures. However, they deal neither with non- 
linguistic structures nor with the interference of speech and audio signals. Furthermore, 
in the present work we use XML to organize and index meta-text under a hierarchical 
representation so that it is further processed. 
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Fig. 1. The architecture of the e-TSA composer. 



3.1 E-text to cXML 

The first step involves adaptation of the source e-text to a hierarchical format that repre- 
sents the structures of the source in a meaningful way for the composer, called composer- 
XML (named cXML). Thus, we can have a complete view of the elements that constitute 
the e-text (e.g. the tag <table>). We call these elements clusters. A cluster in the 
cXML format has a discrete XPath. This approach enables the user to decide how each 
cluster should be vocalized. Details are given in paragraph 3.2. 

The “E-text Adapter” component offers services for importing, storing and inter- 
linking text and meta-text from a number of e-text sources. Because of the variety of e- 
text types, several modules are provided to deal with each specific type (see Figure 1). For 
example, when dealing with MS-Word documents, the Word’s instruction for “bolding” 
text can be represented in cXML with a <bold> tag that starts in an appropriate position 
and ends in its corresponding one. An appropriate module called “Word” implements 
this manipulation. If the document is already in any XML-like format (e.g. VoiceXML), 
it still needs some adaptation to map the source tags to user defined ones, according to 
the configuration that follows. Suitable modules handle this case as well. 

3.2 cXML to ciXML 

The second step in the composer is the production of a complete instruction set for the 
TtS. The target here is to have a scheme that can be recursively applied to any instance 
of a cluster, in order to generate the appropriate speech and audio. For example, one 
would like to apply a “staccato” pattern when reading the <title>in HTML pages or 
a “flat” pitch when reading the <headline>s in an e-newspaper. 

To facilitate this we need a mechanism that will associate a customized speech and 
audio behavior to each cluster. This constitutes the Cluster Auditory Definition (CAD). 
A CAD consists of a collection of parameters throughout the modules of the TtS, an 
XPath description of the corresponding cluster and an ID of the instance (or the type) 
of the e-text it refers to. By having access to the functionality of the modules, we can 
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Fig. 2. A Cluster Auditory Definition (CAD) refers to dedicated meta-text with a customized 
auditory behavior. 



apply different configurations to the specified cluster and optionally include non-speech 
sounds as well. 

To make possible the implementation of a CAD, each module in the underlying TtS 
should provide a Document Type Definition (DTD), where it defines a unique namespace 
and the parameters that customize it [10]. For example, assume a module that handles 
rhythm in a TtS, providing options for applying several tempo patterns. This will define 
a VMOD_RHYTHM namespace and its DTD should contain: 

< ! ELEMENT Tempo EMPTY> 

<!ATTLIST Tempo speed ( lento | andante | allegro) > 

<!ATTLIST Tempo accentual_variation ( legato | staccato) > 

CAD Scripts A script in a CAD must be able to make use of any service and data avail- 
able in the TtS. It actually acts as a link between the cluster and a detailed programmed 
behavior of the TtS when dealing with this cluster. 

We will demonstrate this by an example: assume the HTML document of paragraph 2 
and consider what you would expect from a TtS. We can define seven different CADs. 
The auditory representation we describe lies on personal preferences (we will refer to 
them by their XPath): 

1. /title: apply a “staccato” rhythm pattern upon vocalization, 

2. /hi: form a single accent group during prosody generation, 

3. /p: apply natural prosody, 

4. /table: insert specific start/end beeps to indicate the table, 

5. / table/tr: insert the word “row” to be pronounced in low-pitch, 

6. /table/tr/td: insert word “column” in high-pitch and form a comma-ending 
phrase, 

7. /table/tr/td/a: spell it loudly. 

To create a script, we use transformation rules in XSLT [11] template formats. So, 
the template for the case of the CAD number 7 above (/ table/tr/td/ a) would be 
as follows: 

<template match= " table" > 

<VMOD_SOUND : Insert file= "hyperlink . wav" /> 

<VMOD_ENERGY : Loud> 

<VMOD_RHYTHM : Tempo accentual_variation= "staccato" > 
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<apply-template select ="/ !* /a." ! > 

</VMOD_ENERGY : Loudx/VMOD_RHYTHM ; Tempo 
</template> 

After applying the above template, the produced hie embeds instructions for cus- 
tomizing the functionality of the TtS. The result is formatted in composer-interface XML 
type (named ciXML) and looks like the follow: 

<VMOD_SOUND : Insert file= "hyperlink . wav" ></VMOD_SOUND : Insert > 
<VMOD_ENERGY : Loud> 

<VMOD_RHYTHM : Tempo accentual_variation= "staccato" > 

Click here. 

< /VMOD_ENERGY ; Loudx/VMOD_RHYTHM : Tempo 

This template implies that whenever a link (/ a) cluster is provided in the source 
HTML hie, under a / table cluster, it will be vocalized using the behavior described in 
the template. In this way we implement the recursive feature described at the beginning 
of paragraph 3.2. 

A template can host more complicated instructions concerning the available under- 
lying modules. For example, we can dehne and parameterize the intonation model to be 
followed, how to handle different types of sentences presented in a CAD, etc. Actually, 
this approach constitutes a fully programming interface for the composer that can be 
shared among different TtS architectures, which adopt the e-TSA approach. 

3.3 ciXML to Speech and Audio 

The ciXML produced above needs to be further checked to ensure that it is well formed. 
This procedure confirms the validity of the referred namespaces and it is taking place 
during run-time. The interpretation of ciXML is left to the underlying TtS, though it 
offers a detailed description of the generation process. However, the TtS should provide 
means to parse it and use its own modules to further generate speech and audio. So, the 
basic modifications that are needed for a TtS to host the composer are: 

1. To create DTDs for each module 

2. To create an interface to parse the ciXML document 

Another flexible feature of the composer’s architecture is that a TtS can be config- 
ured to ignore any instructions which refer to modules it does not have, or to map the 
namespaces to local ones. 



4 DEMOSTHeNES: A Complete Composer Implementation 

Based on the requirements we set in paragraph 2 and the approach described above, we 
implemented DEMOSTHeNES Composer to accommodate classical TtS applications, 
but furthermore to facilitate the auditory representation translation of meta-text using the 
e-TSA composer architecture. We will briefly present some aspects of its implementation 
on hosting the composer [12]. 
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DEMOSTHeNIS kernel 



Fig. 3. DEMOSTHeNES architecture. 



Its architecture (Figure 3) mainly benefits from its Component Based Technology 
(CBT) design. It comprises of a set of Vocal COMponents (VCOM) that provides services 
concerning basic text, speech and audio related tasks (e.g. manipulation of segmental du- 
ration, manipulation of linguistic information, etc.) and a Vocal SERVER (VSERVER). 
Dynamically linked Vocal MODules (VMOD) customize and expand the functionality 
of the kernel and the provided services. VMODs are connected to the VSERVER by 
a bi-directional port (BIPORT) that allows them to push and pull information from any 
VCOM. 

In order to host the composer, we added a registration phase where each VMOD 
passes an appropriately formed DTD describing itself to the VSERVER. The VSERVER 
uses a Directory Service to register VMOD and to further locate any functionality de- 
scribed in a ciXML document. We built a VCOM that can translate ciXML documents 
to the internal structures of DEMOSTHeNES. We further configured it to ignore any 
unknown instruction in the ciXML. Finally, the “e-TSA” VMOD is dynamically linked 
to the kernel and implements the composer. For the purposes of this work we further 
implemented a module in the composer that translates HTML texts to cXML using the 
“E-Text Adapter” services. The composer, which uses the MBROLA synthesizer [13], 
is currently implemented in MS-Win32 platform and can offer vocal services to other 
applications. 

5 Conclusions 

In this paper we have introduced an approach for handling meta-text information. Meta- 
text is stored in an appropriate hierarchical structure and it is combined with customized 
functionality (described in XML) of the underlying TtS system, using XML-based 
scripts. The produced document forms a set of instructions that drive the TtS to generate 
the auditory representation of the meta-text. Finally, we showed how we exploit the 
above methodology to build a complete e-text to speech and audio composer. 
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Abstract. A syntactic lexicon of verbs with the subcategorization information is 
crucial for NLP. Two phases of creating such lexicon are presented. The first phase 
consists of the automatic preprocessing of source data-particular valency frames 
are proposed. Where it is possible, the functors are assigned, otherwise the set of 
possible functors is proposed. In the second phase the proposed valency frames 
are manually refined. 



1 Introduction 

In this paper' we introduce a semi-automatically prepared syntactic lexicon of Czech 
verbs that is enriched with information about functors (members of valency frames) 
on the tectogrammatical (underlying) level of language description (Section 2). Such a 
lexicon is crucial for any applied task requiring automatic processing of natural language. 
We focus on verbs because of their central role in the sentence-the information about the 
modifiers of a particular verb enables us to create the ‘skeleton’ of the analyzed sentence. 
It can also be used for example in connection with WordNet for semantic grouping of 
verbs. 

As the source data we use a dictionary of verb frames (originally created at Masaryk 
University) which is automatically preprocessed (Section 3). In the first phase we only 
process small set of verbs and their frames. This testing set serves for the estimation of 
the extent of changes in automatically pre-processed valency frames which must be done 
manually (Section 4). More extensive sets will follow. We expect that a substantially 
richer lexicon will be available in several months. In the last section (Section 5) the 
(preliminary) results are presented. 

2 The Concept of Valency Frames of Verbs 

Valency theory is a substantial part of the Functional Generative Description of Czech 
(FGD, [Sgall et al, 1986]), and has been intensively studied since the seventies. Originally 
it was established for verbs and their frames (see esp. [Panevova, 1974-1975, 1980, 
2001]), and was later extended to other parts of speech (nouns and adjectives). 

* This work has been supported by the Ministry of Education, project LN00A063, and GACR 
405/96/K214. 
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The concept of valency primarily pertains to the level of underlying representation 
(linguistic meaning) of a sentence and thus it is one of the most important theoretical 
notions. On the other hand, as the valency information plays a crucial role also for NLP, 
the morphemic representation of particular members of valency frame is important. 

A verbal valency frame (in a strict sense) is formed by so called valency modifiers- 
that is, the inner participants, either obligatory or optional, together with the obligatory 
free modifiers. Each Czech verb has at least one valency frame, but it can have more 
frames. Slots for valency modifiers together with possible morphemic forms of inner 
participants are stored in a lexicon. 

On the level of underlying representation, we distinguish five actants (inner partici- 
pants) and a great number of free modifiers. The combination of actants is characteristic 
for a particular verb. Each actant can appear only once in a valency frame (if coordination 
and apposition are not taken into account). The actants distinguished here are Actor (or 
Actor/Bearer, Act), Patient (Pat), Addressee (Addr), Origin (Orig) and Effect (Eff). On 
the contrary, free modifiers (e.g. local, temporal, manner, casual) modify any verb and 
they can repeat with the same verb (the constraints are semantically based). Most of 
them are optional and only belong to a ‘valency frame’ in a broader sense. 

The inner participants can be either obligatory (i.e. necessarily presented at the level 
of underlying representation) or optional. Some of the obligatory participants may be 
omitted in the surface (morphemic) realization of a sentence if they can be understood 
as general. Similarly, there exist omissible obligatory free modifiers (as e.g. direction for 
‘pfijit’ (to come)). Panevova ([Panevova, 1974-1975]) stated a dialog test as a criterion 
for the obligatoriness of actants and free modifiers. 

EGD has adopted the concept of shifting of ‘cognitive roles’ in the language pattern- 
ing ([Panevova, 1974-1975]). Syntactic criteria are used for the identification of Actor 
and Patient (following the approach of [Tesniere, 1959]), Actor is the first actant, the 
second is always the Patient. Other inner participants are detected with respect to their 
semantic roles ([Eillmore, 1968], for Czech [Danes, Hlavsa, 1981]). 

Eor a particular verb, its inner participants have a (usually unique) morphemic form 
which must be stored in a lexicon. Tree modifiers typically have morphemic forms 
connected with the semantics of the modifier. Eor example, a prepositional group Prep 
‘na’ (on) -t Accusative case typically expresses Direction, Prep ‘v’ (in) + Local case has 
usually local meaning - Where. 

In addition to the classical theoretically-based valency also quasi-valency is intro- 
duced which may be paraphrased as ‘commonly used modification’ of a particular item. 
The concept of quasi-valency enables us to enlarge the information stored in the lexi- 
con, to capture also modifications not belonging to the valency frame in a strict sense 
([Strahakova, submitted]). There are free modifiers which are not obligatory (and hence 
do not belong to the standard valency frame) though they often modify a particular verb. 
Three sources of such modifiers can be distinguished - (i) ‘usual’ modifiers without 
a strictly specified form (like Direction for ‘jit’ (to go), or Local modifier for ‘bydiet’ (to 
stay)), (ii) modifiers with a determined morphemic form (often Regard, e.g., ‘zvyhodnit 
V necem/na necem’ (to make (st) advantageous for st), or Aim (‘potfebovat / poskytovat 
na neco’ (to need / provide (st) for st)), and (iii) theoretically unclear cases with ‘wider’ 
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and ‘narrower’ specification (e.g., cause in ‘zemfit na tuberkulozu kvuli nedostatku leku’ 
(to die of tuberculosis because of the lack of medicine)). 

Idiomatic or frozen collocations (where the dependent word is limited either to one 
lexical unit or to small set of such units, as e.g. ‘mft na mysli’ (to have on mind)) represent 
specific phenomenon. We resigned on a very complex task of their processing in this 
stage. 

The concept of omissible valency modifiers is reopened with respect to the task of 
the lexicon. The omissihility of a modifier is not marked in particular lexical entries-we 
presuppose that in the surface (morphemic) realization of the sentence any member of 
valency frame is deletable (at least in the specific contexts as e.g. in a question-answer 
pair). 

Analogically, the fact that particular actant can be realized as a general participant 
is not marked in the valency frame of a verb. 



Table 1. Verbal modifiers stored in the lexicon. 





obligatory 


optional 


inner participants 


including general participants 


+ 


free modifiers 


including omissible modifiers 


“commonly used” 



3 Data Preprocessing 

As the source data we use a dictionary of verb frames created at Masaryk University 
([Pala and Sevecek, 1997], [Horak, 1998]). The lexicon contains valency frames of circa 
15,000 Czech verbs. The structure is described in [Horak, 1998]. 

3.1 Algorithm for Automatic Assigning the Functors 

Identifying and merging frames. In the source lexicon, every lemma is listed only 
once, even if it has several valency frames. A single valency frame, on the other hand, 
can have several variants (e.g. ‘ucit koho co(acc)’, ‘ucit koho cemu(dat)’ (to teach sb 
st)). The variants of one frame are mixed with other frames and thus the first task is to 
separate the different frames and merge the variants. Let us show it on an example. The 
verb ‘branit’ (to protect/prevent) has the following format in the source lexicon: 
branit <v>hTc3 , si , hPc3 -sUeN, hPc3 -hTc6r{v} , hPTc4 , 
hPTc4-hPTc3r{proti} , hPTc4 -hPTc7r{pfed} 

Single frames are separated by commas and members inside a single frame are 
separated by dashes. The attribute ‘h’ describes ‘semantic’ features (P-person, T-thing), 
the attribute ‘c’ stands for morphemic case, ‘r’ means the value of the preposition (in 
curly braces), ‘sT means infinitive and ‘sUeN’ is negative clause with conjunction ‘aby’ 
(that). 

Now, we can arrange the members of all its frames into a table and we can try to find 
maximal non-intersecting parts. 
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hTc3 










sl 










hPc3 sUeN 

hPc3 hTc6r {v} 










hPTc4 

hPTc4 hPTc3r{proti} 

hPTc4 hPTc7r{pfed} 



In the table above we can identify 4 parts. The members that never occur in one 
frame together can be declared with high probability as variants of one member. Frames 
with single members (like the first and second frame in the example) can be understood 
as separate frames, as in the case of ‘mffit kam’ (to head somewhere), ‘mifit na koho’ (to 
aim at sb), or as variants of one frame, as in the case of ‘badat nad cim’, ‘badat o cem’ 
(to research into st). We decided to ‘merge as much as possible’, because of an easier 
assignment of the functors. The result is shown below, 
branit <v> [hTc3 | si] 
branit <v> [hPc3] - [sUen | hTc6r{v}] 
branit <v> [hPTc4] - [hPTc3r{proti} | hPTc7r{pred}] 



Assigning functors. First, we have to add missing subjects to all frames. Then we 
assign functors to all members of a frame. Unfortunately, there is no straightforward 
correspondence between the deep frame and its surface realization, but we can try to 
find some regularities or tendencies, and then formulate rules for assigning the functors 
to the surface frames. Among all correspondences between the two levels, there are some 
which are considered as typical. In the direction from the tectogrammatical level to the 
morphemic one these are: 

Actor — Nominative, 

Patient — Accusative, 

Addressee — (animate) Dative, 

Effect — Prep ‘na’ (to) + Accussative, or Prep ‘v’ (into) + Accusative, 

Origin — !■ Prep ‘z’ (from) + Genitive, or Prep ‘od’ (from) + Genitive. 

In the opposite direction the correspondences are not so clear because of free modi- 
fications, which have a very broad repertory of surface realizations. 

For the successful assignment of actants it is necessary to identify free modifiers. 
The identification is done already during the merging the frames: there exists a list of 
possible functors for every surface realization, and this list is attached to every member 
of the original frame. When we merge two members of a frame together we also make an 
intersection of the attached lists. An empty intersection prevents the two members from 
being merged. It means that we also get a set of possible functors for every member of 
a frame as the result of the merging phase. In the optimal case, every member has only 
one functor assigned. 

After identifying free modifiers we can use an algorithm proposed by Panevova and 
Skoumalova ([Panevova and Skoumalova, 1992]) for the actants. This algorithm is based 
on the observation that verb frames fall in two categories. The first category contains 





146 



H. Skoumalova, M. Stranakova-Lopatkova, and Z. Zabokrtsky 



frames with at most two actants. The functors are assigned on the base of the ‘rule of 
shifting’ (see Section 2)-if there is only one actant in the frame it must be an Actor, 
and if there are two, one of them is an Actor and the other a Patient. As we had to add 
subjects automatically, we also made the assumption that they all represent Actor, and 
thus all frames in this category are already resolved. 

The other category contains frames with at least three actants, which can be sorted into 
two subcategories: prototypical and non-prototypical. The prototypical frames contain 
only typical surface realizations, and the rule about typical realization can be reverted: if 
the surface frame contains only typical surface forms we can assign the corresponding 
functors to them. The non-prototypical frames contain at least one untypical surface 
realization and a different approach must be adopted. The algorithm is described in 
[Skoumalova, submitted]. 

After the merging phase, we get three sorts of frames: frames where every member 
of a frame has only one functor assigned; the second category contains frames with 
identified actants but ambiguous free modifiers; and the third category contains frames 
where at least one member is ambiguous between an actant and a free modiher. Approx- 
imately one third of all merged frames (circa 6500) falls into the first category (‘final’ 
frames in the sequel) and another thousand into the second category. These frames are 
candidates for further processing with the help of the above mentioned algorithm, and 
therefore they will be separated from the rest (circa 11,000), which must be left for 
manual post-editing (the frames belonging to the second and third category are referred 
as ‘ambiguous’). The editor’s work should be easier as s/he gets a (small) set of possible 
functors which can be assigned to every member of a frame and s/he does not have to 
choose from all 47 possibilities. 

3.2 Testing Set 

For the purpose of testing we made a small set containing 178 most frequent Czech verbs 
with their frames. We omitted the verb ‘byt’ (to be) as it needs a special treatment, and 
several modal verbs. The set contained circa 350 frames that were created automatically 
from the source lexicon. They fall into all three categories mentioned above, which 
means 1) fully resolved frames, 2) frames with ambiguous free modifiers, and 3) frames 
with ambiguities between actants and free modifiers. 

4 Manual Annotation 

The data resulting from the preprocessing step are not perfect: they contain incorrectly 
or ambiguously assigned functors, valency frames proposed may contain mutually ex- 
cluding (alternating) modifiers, some frames are incorrectly merged into a single one, 
etc. 

That is why we developed a ‘tailored’ editor for the manual processing of the valency 
frames of verbs which were pre-processed automatically, as was described above. The 
editor was implemented as a relational database in Microsoft Access environment. 

After obtaining some experiences with annotating the lexicon, we exported the 
data from the relational database into XML data format (Extensible Markup Language, 
[Kosek, 2000]). Presently, the XML data are annotated directly in a text editor. 
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The following attributes are captured for each frame slot: 

- functor; 

- surface: morphemic realization (mostly morphemic case of a noun or a prepositional 
group), or a list of possible realization of the particular modifier; the value can be 
omitted if no special surface realization is required for the given slot (e.g. directional 
circumstantionals) ; 

- type: this attribute differentiates between obligatory, optional, and quasi-valency 
modifiers; 

- alternative: modifiers, which are mutually excluding, are marked. 



4.1 Examples 

The following examples illustrate the automatically assigned functors and the manual 
refinement of valency frames. 

The verb ‘existovat’ (to exist) only has a valency frame that belongs to the first cat- 
egory (fully resolved frames): 

existovat R-l[hPTcl]E[hTc2r{u}|hTc6r{na}|hTc6r{v}]$ 
translated as Actor (Nom) Loc (u+2/na+6/v+6) 

manually added mark for arbitrary morphemic realization of local modifier. 

The verb ‘pusobit’ (to act/operate/work) has been automatically assigned with three 
valency frames, two of them (lst,3rd) marked as ‘ambiguous’, one (2nd) as ‘final’: 

pusobitl (to operate on st with st) R— l[hPTcl]2CI[hTc7]2A[hPTc4r{na}]& ‘ambig.’ 
translated as Actor (Nom) amb. (na+Acc) amb. (Ins) 
manually changed to Actor (Nom) Patient (na+Acc) Means (Ins), 
where Actor and Patient are obligatory. Means is a quasi- valency modifier; 

pusobit2 (to do st to sb) R— l[hPTcl]2[hTc4]3[hPc3]& ‘final’ 
translated as Actor (Nom) Patient (Acc) Addr (Dat) 
manually the alternative surface forms for Patient are added - 
clause attached with conjunctions ‘ze’ (that) or ‘aby’ (so that); 

pusobit3 (to work as sb)R— l[hPTcl]2P[sU]2JR[hTc4r{jako}]& ‘ambig.’ 
translated as Actor (Nom) amb. (aby) amb. (jako+Acc) 
manually changed to Actor (Nom) Patient (jako+Nom) / Loc 
where the modifier attached with the conjunction ‘aby’ belongs to 
the second frame (as an alternative representation of Patient), 
here the Patient alternates with the Local modifier. 



5 Evaluation of Results, Conclusions 

In this stage of work only a small testing set of verbs and their frames has been treated. 
This set serves for clarifying the way of manual processing (‘what’ and ‘how’ we want to 
catch up). After this small lexicon will be brought to perfection it will be used for further 
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development and testing of automatic precedures. But even on this set of available data 
some preliminary results can be stated. 

It is clear now that even the frames marked as ‘hnal’ after the pre-processing must be 
checked and manually rehned — about 35 percent of ‘hnal’ valency frames were perfect, 
i.e. 13 percent from all frames proposed. Fortunately, there was a relatively large number 
of frames which only ‘slightly’ differ from the issues wanted-approximately 16 percent 
of valency frames were correctly merged, but the functors were assigned incorrectly 
(often ‘verba dicendi’), in 20 other percent either one functor is missing in the frame, 
or is superfluous. About 27 percent of frames were deleted (circa one half as incorrect, 
one half as frames already detected with other morphemic realization). Then the missing 
frames were manually added and several cycles of corrections followed. We proceeded 
a cross checking: we extracted and separately compared sets of frames containing a 
certain functor, we compared frames of verbs with similar meaning etc. 

Basic statistical characteristics are presented: 

- number of the processed verbs: 178 

- number of the frames: 462 (in average 2.6 frames per a verb) 

- number of all frame slots: 1481 (in average 3.2 slots per a frame) 

- distribution of the number of frame slots per a frame (Table 2) 

- distribution of frame slots according to their type (Table 3) 

- number of occurences of individual functors in the lexicon (Table 4). 



Table 2. Distribution of the number of frame slots per a frame. 



number of slots 


1 


2 


3 


4 


5 


6 


7 


8 


number of frames 


16 


145 


134 


95 


45 


15 


10 


1 


% (out of all frames) 


3.5 


31.4 


29.0 


20.6 


9.7 


3.2 


2.2 


0.2 



Table 3. Distribution of the frame slots according to the type. 



type 


obligatory 


optional 


quasi-valency 


occurences 


918 


200 


363 


% (out of all slots) 


62.0 


13.5 


24.5 



Table 4. Number of occurences of 18 most frequent functors. 



order 


functor 


occurences 


order 


functor 


occurences 


1 


ACT (actor) 


460 


10 


ORIG (origin) 


40 


2 


PAT (patient) 


362 


11 


DIRl (direction to) 


25 


3 


ADDR (addressee) 


93 


12 


BEN (benefactive) 


23 


4 


EFF (effect) 


86 


13 


AIM (aim) 


21 


5 


MANN (manner) 


71 


14 


ACMP (accompaniment) 


18 


6 


REG (regard) 


67 


15 


TWHEN (time-when) 


15 


7 


LOC (location) 


49 


16 


DIR2 (dir. which way) 


14 


8 


D1R3 (direction from) 


49 


17 


EXT (extent) 


13 


9 


MEANS (means) 


48 


18 


INTT (intention) 


7 
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Roughly one half of the processed verbs is contained in the Czech part of EuroWord- 
Net lexical database [Pala, Sevecek, 1997]. Currently we try to map the valency frames 
to EuroWordNet synsets. 

We expect that the large amount of time consumed by the preparation of such a small 
lexicon has its source in the fact that we have processed the most frequent Czech verbs, 
which likely belong to the most difficult ones. The extension of data processed may lead 
(and we hope so) to an increased effectiveness of the algorithm presented. 
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Abstract. A graph based information retrieval framework for the visualization 
of clusters of related legal documents is proposed. Clusters are obtained through 
the analysis of the citations and topics of the retrieved documents. 

Users, interactively, may rehne their queries by selecting the graph nodes. Then, 
the context of the users interaction is calculated and it is shown in a tree-structure. 
Users are allowed to return to previous interaction states by selecting these tree 
nodes. 

This framework was applied to the decisions of the Portuguese Attorney General 
and it was made available in the web (http://www.pgr.pt - in Portuguese [3]). 
A detailed example of a user interaction with the system is shown in the paper. 
An evaluation procedure was applied to the system and it showed a decrease of 
the average number of interactions per user session. 



1 Introduction 

Legal information retrieval systems are increasing their complexity, namely, the size and 
number of their text bases. Moreover, there is a special need for IR systems that are able 
to calculate and to show the relations between the different legal texts. In fact, legal texts 
can be related via different aspects: citations, topics, and author. 

In this paper we present a web legal information retrieval system that is able to 
cooperatively calculate and show relations between the retrieved set of documents. These 
relations are shown using graphs and the users are allowed to refine their queries by 
selecting the graph nodes. 

Cooperation is also achieved through the visualization of the interaction context in 
a tree-structure. Users are allowed to return to previous interaction states by selecting 
some of these tree nodes. 

This framework was applied to the decisions of the Portuguese Attorney General and 
it was made available in the web (http://www.pgr.pt - in Portuguese [3]). The framework 
was implemented in a logic programming environment that has been used previously 
with success to model rational agents and their interactions [6]. 

Section 2 briefly describes the core information retrieval module. Section 3 describes 
the clustering procedures that were used and section 4 describes the construction of the 
context structure. In section 5 a detailed example is presented and in section 6 some 
evaluation results are presented. Finally, in section 7 conclusions and future work are 
pointed out. 



V. Matousek et al. (Eds.): TSD 2001, LNAI 2166, pp. 150-157, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 
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2 Information Retrieval Module 

The legal information retrieval system is based on SINO [2] from the AustLlI Institute. 
SINO was changed in order to be adapted to the Portuguese Language. Namely, the 
new system uses the Portuguese lexicon (more than 900,000 words) in order to handle 
morphological errors and to obtain the base queried word. For instance, if the user asks 
to be informed about documents where a specific word appear, the systems also searches 
for documents containing derived words (plurals for nouns, verbal forms for verbs, ...). 

As a top layer over the basic IR system we are using a juridical terms thesaurus. This 
thesaurus is a result from another project: PGR - Selective access of documents from 
the Portuguese Attorney General[7,4]. 

The juridical terms thesaurus can be described as a taxonomy which has the relations: 

- is equivalent to 

ex: law is equivalent to norm 

- is generalized by 

ex: prime minister is generalized by minister 

- is specified by 

ex: accident is specified by traffic accident 

- is related with 

ex: desertion is related with traffic accidenf 

The fhesaurus is used fo expand queries to include all the values that are equivalent 
or more specific or related, with the initial query (more information can be found in [3]). 

The result is a powerful IR system, which has many similarities with the work 
described in [ 1 ] , namely, allowing the extraction of textual information using localization, 
inference, and controlled vocabulary. 

3 Document Clustering 

In our framework we decided to cluster documents based on two different characteristics: 
citations and subjects. All documents were previously analyzed and a database relating 
each document with its citations and subjects was built. Then, for each user query, a 
set of documents is obtained (using SINO) and these documents are clustered using the 
relations previously calculated. Finally, the obtained clusters are visualized as a list or 
as a graph structure (using the Graphviz package developed by AT&T and Lucent Bell 
Labs). For a complete description of the cluster process, namely its methodology and 
algorithms, see [5]. 

3.1 Citations 

The complete set of legal documents from the Portuguese Attorney General was pro- 
cessed to obtain the citations between documents. 

In order to obtain all the citations it was necessary to construct a specialized chart 
parser, which is able to partially analyze the text and to retrieve and to normalize the 
citations between documents. Note that documents can be cited by their number, date, 
title, author, and by almost any mixture of these fields. Taking fhis fact into account. 
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we have built a database with these fields for each document and, whenever a possible 
citation appears in the text, the parser checks if the cited document exists and it builds 
a new citation entry in the database. 

These citations are used by the system to build the graph of relations between the 
set of documents retrieved by the user queries. 

As an example, the query “bombeiro” (fireman) obtains the following graph of 
citations: 




Fig. 1. Citations: Bombeiro - Fireman 



Note that it was possible to detect an important legal document that is referred by 
most of the other retrieved texts. It was, probably, a document that created jurisprudence 
about some fireman cases. The user is able to directly search that document by “clicking” 
in the document node. 

3.2 Topics 

In order to obtain the clusters of topic relations it was necessary to classify each docu- 
ment accordingly with a set of concepts previously defined by the Portuguese Attorney 
General Office (PAG Office). The classification was done manually by the PAG Office 
and automatically by a neural network [5]. The documents were parsed in order to build 
the topics relationship database. Using this database it is possible to visualize the graph 
of topic relations and/or a list of clustered concepts. 

Graph of topic relations. The graph of relations is calculated using the database of topic 
relations that was manually and automatically previously built. Each pair of documents 
with, at least, 90% of common topics are related by a graph arc. 

As an example, the query “bombeiro” obtains the following graph of topic relations: 




Fig. 2. Topics: Bombeiro - Fireman 
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As in the previous section it was possible to detect a cluster of closely related docu- 
ments and a set of non related documents (probably about some minor distinct subjects). 
As it can be seen, using this approach users are cooperatively helped in their searches. 

4 Context Structure 

The user interaction context is kept in a tree structure. This structure records both user 
and system questions and answers. The structure is used to compute the meaning of 
a user query and to allow the user to return to a previous point of the dialogue and to 
build a new branch from there. 

As an example of the use of the interaction structure suppose that, after the query 
“bombeiro”, the user wants to refine the query by the concept “ferido” (hurt). Figure 3 
shows this interaction structure (in Portuguese). 



The visualization of the tree-context structure allows the user to easily select non- 
explored branches of the tree and to refine his previous queries. 

In the example, if the user queries the system with the expression “militar” (military), 
the system may be able to detect that, as there are no documents about “fireman”, “hurt”, 
and “military”, the intended user query may be the refinement of a previous query: 
“fireman”, and “military”. Figure 4 shows the tree-structure after this inference. 

In the next section a more detailed example is presented. 





bombeiro and ferido 



Fig. 3. Query: Bombeiro - Fireman; Refine by “ferido - hurt” 





bombeiro and militar 
bombeiro and ferido 



Fig. 4. Query: Bombeiro - Fireman; Refine by “militar - military' 
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5 Interaction Example 

Suppose the user wants to find documents about militaries that have been injured and 
received a pension from the government. He wants to find out if there have been similar 
cases in the legal knowledge base of the Portuguese Attorney General. 

In order to help readers and to keep the example short, the Portuguese web version 
is not presented. Instead an English compacted version of the interaction is shown. The 
complete web interaction can be easily reproduced from the URL http://www.pgr.pt 
[User - Ql:] Militares (militaries) 

[System - Q2:] Result: 1127 documents 

Documents with keyword: 

(120) incapacity 
(86) medal 
(28) court 
(16) commission 
(11) rights 
(4) marriage 



Options on this list were obtained by clustering the set of 1127 documents. 
[User - Q3:] Chooses: incapacity 
[System - Q4:] Result: 120 documents 

Documents with keyword: 

(10) service 
(6) pension 



Options in this column are obtained by re-clustering the 120 documents. 

[User - Q5:] Chooses: pension 
[System - Q6:] Result: 6 documents 

Options in figure 5 were obtained by graph clustering by topic the 6 documents. 
[User - Q7 : ] Refinement: illness (after analyzing the topics of the retrieved documents 
and deciding that they were not relevant for his goals) 

[System - Q8:] There are no documents of the 6 above that match your new constraint. 
I assume that your query refines Q3, i.e. you are looking for documents about: militaries 
and incapacity and illness. 







Fig. 5. Graph clustering by topic 
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Graph by WebDot 

Fig. 6. Graph clustering hy topic 



[User - Q9:] Result: 6 documents 

Options in Figure 6 were obtained by graph clustering by topics the 6 documents. 

This example shows some of the flexibility and cooperativeness of the system, al- 
lowing users to dynamically refine their queries and helping them in a pro-active way, 
giving hints and clustering the retrieved documents. 

During the interaction, the tree representation of the dialogue is being inferred and 
displayed in a visual tree-diagram. As an example, the tree representation of the previous 
example is presented in Figure 7 (in a compact version). 



[Q1-Q2 



[Q3-Q4] 

[Q5-Q6][Q7-Q8-Q9] 



Fig. 7. Interaction Structure Tree 
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6 Evaluation 

As it is widely known, the evaluation of legal systems with knowledge representation 
and reasoning capabilities are a very complex and difficult task [8]. 

The main goal of this evaluation process was not to evaluate the information retrieval 
module but the impact of the graph visualization of clusters in the interaction with the 
users. 

We have defined two sets of users: one using the clustering module and the other 
accessing directly the information retrieval module. We have recorded their queries 
during the second semester of the year 2000 and the preliminary results show that our 
system is able to help the users decreasing the average number of queries needed to 
obtain the desired documents (around 20%). 



7 Conclusions 

We have presented a legal information retrieval system that uses graphs to visualize 
clusters of related documents. The visualization of these clusters may help the users 
in their searches, showing the relations between the documents and clustering them by 
topic. 

Moreover, the interaction context is inferred and displayed in a tree-structure. This 
visualization allows the users to analyze the history of the interaction and to explore 
other branches of the structure. 

The preliminary evaluation results, showed that our cooperative system is able to 
help the users decreasing the average number of queries needed to obtain the desired 
documents (around 20%). 

As future work, we intend to obtain more evaluation results (quantitative and quali- 
tative) of the cooperative graphical clustering module. We also would like to apply our 
system to other information retrieval systems, namely to non-Portuguese legal docu- 
ments. 
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Abstract. The problem of automatic text segmentation is subcategorized into 
two different problems: thematic segmentation into rather large topically self- 
contained sections and splitting into paragraphs, i.e., lexico-grammatical segmen- 
tation of lower level. In this paper we consider the latter problem. We propose 
a method of reasonably splitting text into paragraph based on a text cohesion mea- 
sure. Specifically, we propose a method of quantitative evaluation of text cohesion 
based on a large linguistic resource - a collocation network. At each step, our 
algorithm compares word occurrences in a text against a large DB of collocations 
and semantic links between words in the given natural language. The procedure 
consists in evaluation of the cohesion function, its smoothing, normalization, and 
comparing with a specially constructed threshold. 



1 Introduction 

In the recent decade, automatic text segmentation became a popular research area [4- 
13,15,17,19]. In most cases, thematic segmentation is considered, i.e., the borders to be 
searched subdivide the text to rather long thematically self-contained parts. In contrast 
to most works in the area, in this paper we propose a method for a low-level, lexico- 
grammatical segmentation. The difference between these two segmentation tasks can be 
explained as follows. 

A good application of thematic segmentation is automatic extraction of thematically 
relevant part(s) from a long unstructured file. When a file is too long for the user to read it 
through completely, a computer tool - a segmentation program - is quite handy. Another 
application of such segmentation is consulting a novice author on a better splitting his/her 
large and not yet brushed sci-tech text to balanced and thematically diverse sections. 

As the main tool for thematic segmentation, the sets of terms belonging to each 
potential segment are considered. For example, the words most frequently used in the 
whole text are selected, the stop-words (mainly functional) are discarded, and then the 
similarity between adjacent potential segments is measured across the potential border 
as the cosine coefficient of occurrence numbers of the rest content words. 

In such a task, the segmentation of the lower level, i.e., the division of text into 
sentences and paragraphs is supposed to have been done. Thus, paragraphs are considered 
as minimal text units with already determined lengths (measured in words or sentences) 
and terminological content. 
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However, this is by itself a problem faced by every author, namely, the problem of 
optimally splitting the text into paragraphs. One might consider rational splitting text 
into paragraphs a component of general education at school in writing correct texts. 
However, numerous manuscripts of master-level students show that this component of 
school education is not efficient: many people who have to write sci-tech texts do not 
do it well. Specihcally, grammatically correct texts written by humans are frequently 
subdivided into paragraphs in a rather arbitrary manner impeding smooth reading. 

According to [22], the rational low level structuring of sci-tech texts is rather difficult 
even for humans. Besides splitting text into paragraphs it includes other difficult tasks, 
e.g., introduction of numbered or dotted items. In this paper we conhne our-selves only 
to the task of splitting text into paragraphs. 

It is a commonplace that singling out a paragraph conforms to some grammatical 
and logical rules that seem to be so far not formalized and thus not computable. From 
this point of view, the work [21] is an important step to this objective, but it supposes the 
problem of how to represent automatically by logical terms the meaning of any sentence 
and a text as a whole to have been solved, whereas modern computational linguistics 
only aims at this goal. 

In this paper, we propose a method of lexico-grammatical segmentation of lower 
level, i.e., splitting texts into paragraphs. It is based on the following conjectures: 

• Splitting text into paragraphs is determined by current text cohesion. Cohesive links 
are clustered within paragraphs, whereas the links between them are significantly 
weaker. 

• At present, text cohesion has no formal definition. A human considers a text cohesive 
if it consistently narrates about selected entities (persons, things, relations, actions, 
processes, properties, etc.). At the level of semantic representation of text, cohesion 
is ’observable’ in the form of linked terms and predicates of logical types, but it is 
not well explored how one can observe the same links ’at the surface.’ 

• In such conditions, it is worthwhile to suppose that text cohesion can be approxi- 
mately determined through syntactic, pseudosyntactic, and semantic links between 
words in a text. 

By pseudo-syntactic link we mean links that are similar to syntactic ones but hold 
between words of different sentences, for example, the link between chief and 
demanded in the text She insulted her chief. He demanded on apology} 

• Syntactic links are considered as in dependency grammars [14], which arrange 
words of any sentence in dependency trees. In the example (she hurriedly) went 
through (the big) forest, the words out of parentheses constitute a dependency 
subtree (in this case, a chain) with the highlighted content words at the ends and the 
functional (auxiliary) word in between. The words within parentheses, as well as all 
other possible words of the sentence, are linked into the same tree, and other pairs 
of linked content words can be observed among them, such as hurriedly went 
or big forest. 



* Formally, we define such a link to hold between words a and b (not necessarily in the same 
sentence) if in the text there is a word c coreferent with a and syntactically linked to b. 
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Syntactic links between two content words are called collocations^, whereas func- 
tional words only subcategorize them. Indeed, collocations can be of various classes: 
the first example above represents a combination of the ruling verb and its (prepo- 
sitional) complement, while two other examples give combinations of verb or noun 
with their modifiers. 

• Semantic links are well known. They connect synonyms, hyponym with a corre- 
sponding hyperonym, the whole with its part, or a word with its semantic deriva- 
tive, like possessor to possessive or to possess. When occurring in the same text, 
such words are rarely linked syntactically. Their co-occurrences have other reason. 
Namely, the anaphoric (coreferential) entities can be represented in a text not only 
by direct repetitions and pronouns, but also by their synonyms or hyperonyms. 

• A quantitative measure of cohesion implied by (pseudo)syntactic and semantic links 
can be proposed. This measure experiences fluctuations along the texts, with max- 
imums in the middle of the sentences and minimums between them. Some local 
minimums are deeper than others. Just they should be taken as splitting borders. 

This paper proposes a method of quantitative evaluation of text cohesion. It com-pares 
word occurrences in a text against a large DB of collocations and semantic links in a given 
natural language. (Pseudo-)syntactic links are more important since within segments 
comparable with paragraphs by length no statistics of relevant terms can be collected. 
Taking into account the co-occurrences, our method processes cohesion function stage 
by stage, i.e., recurrently evaluates this function, smoothes it, normalizes, and compares 
it with a specially constructed threshold. 



2 Databases of Collocations and Semantic Relations 

An example of a huge DB containing semantic relations between English words is 
WordNet [3]. The EuroWordNet system [20] presents the same semantic relations for 
several other European languages. Regrettably, there are no collocations in these databa- 
ses, in our definition of this term. Though semantic relations can be found in these 
sources, they alone do not solve the problem of evaluation of cohesion. 

The only large DB of collocations and semantic links we know is CrossLexica 
system [1,2]. Unfortunately, it covers only Russian language. However, we consider a 
system of this type as a base for our algorithm, in the hope that large resourced of this 
type will be available soon for other languages such as English or Spanish. 

Let us discuss now the notion of pseudo-syntactic links in more detail, since such 
links are very important for our purposes. 

Syntactic links hold within a sentence. Hence, if a pair of content words co-occurring 
in the same sentence is found in the collocation DB as potentially forming a link, the 
probability of this syntactic link between them in the given sentence is very high. Even 
if the link between the words in the text is different from the link registered in the DB for 
these words (e.g., the text contains the woman who went. . . while the DB contains woman 

^ There are different definitions of a collocation, e.g. , [4] [ 1 1 ] . Some of them are based on statistical 
properties of word occurrences, e.g., mutual information. However, we define a collocation in 
the way explained here and use this term in this meaning throughout the paper. 
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^ go), the observed co-occurrence almost always gives evidence for some cohesion. In 
essence, we take into account anaphoric links in such cases. 

Similarly, anaphoric links, which we cannot detect directly, permit us to suppose 
cohesion between lexemes chief and demand when they are registered in the DB as 
immediately linked but occurred in the adjacent (but different) sentences: She insulted 
her chief. He demanded on apology. 

As to the semantic links, they hold across sentence borders even more frequently 
than the anaphorically conditioned pseudo-syntactic links mentioned above. 

All these considerations give us grounds to ignore full stops in a text for detecting 
cohesive pairs in adjacent sentences. 

3 Quantitative Evaluation of Text Cohesion 

In this section we present the algorithm of calculation of the text cohesion. 

The algorithm uses a discrete variable i - the number of the word in the text. Punctu- 
ation marks have no numbers, and full stops ending sentences are not taken into account 
at this stage. 

At each step, for the given position in the text the algorithm calculates the value of 
a special function that is to be compared with a threshold; the current position is then 
advanced. As soon as the value of the function crosses the threshold, a new paragraph is 
started, and the internal variables are reset. The details of the calculation of the function 
are explained below. 

Let i be the current observation point within a text. To the left, some syntactically 
interconnected word pairs {pk, Qk}, Pk < <7fc < L have occurred. We dehne a partial 
measure of cohesion implied hy k — th such pair as the function U{qk — Pk), where 
<7fc —Pk is the distance between words in the pair. Naturally, U decreases with the growth 
of qk—Pk- We may suppose also that U depends on the class Tk of the syntactic relation 
within the pair and on the specihc lexemes X{k) occurred at the points k. However, in 
a rough approximation we ignore the dependence on the class and the lexemes. 

It is natural to suppose that the impact of the pair {pk, qk} decreases at the point 
i along with its moving away from qk- We evaluate this by the exponential factor 
exp(— a(i — qk)). For the accumulated impact on the text cohesion of all (pseudo- 
jsyntactically related words to the left of i (including pairs with the latter word in i), we 
have the following value: 

U{qk - Pk)exp{-a{i - qk)) (1) 

Pk<i 

For each semantically related pair {rrik, rik}, ruk < rik < i, the partial measure of 
cohesion is taken as V {uk — rrik), and the dependence of V on the distance Uk — rrik 
is generally different from that for V. Again, let us ignore the dependence of V on the 
semantic relation class Sk of the k-ih pair and specific semantically linked lexemes. With 
the same exponent reflecting the ’forgetting’ process, the measure of the total prehistory 
for semantically related pairs is: 

V{nk- nik) exp{-a{i - Uk)) 

nfc <z 



( 2 ) 




162 



I.A. Bolshakov and A. Gelbukh 



By (1) and (2), the global cohesion function F{i) satisfies the equation 

F{i) = exp(— a)F(i — 1) + Q{i) (3) 

where 

Q{i) = '^U{qk-Pk)+ X] ^ {nk - ruk) (4) 

Pk^i Uk^i 

The functions U and V were taken also in the exponential form: 

U{rik - mk) = Aexp{-P{nk - mk));V{nk - ruk) = Bexp{-S{nk - ruk)) 

where A and B are constants with a ratio between to be selected experimentally; (3 = 
(1 . . . 3)/L; (5 = (0.5 . . . 1)/L; L is the mean length of the sentence. We can evaluate the 
equation 3 recurrently, since its current value is composed of the previous value taken 
with a coefficient less that 1 and the contribution Q{i) of all pairs whose former points 
coincide with the current observation point. Strictly speaking, the impact of the pairs 
extends backward to the very beginning of text, but really the only pairs distant not more 
than approximately 1 /a from the observation point are influent. It can be considered as 
the ’window width’ of the computing algorithm. 

4 Smoothing and Normalizing the Cohesion Function 

The cohesion function F{i) obtained above has two sources of randomness. First, it is 
heavily saw-toothed, i.e., contains many local minimums and maxi-mums, which that is 
caused by random scattering of content words in sentences. Before searching relevant 
minimums in this curve it is necessary to smooth it. The simplest smoothing is linear 
[16], when the output (smoothed) function G(i) is obtained by values of an input function 
F{i) (to be smoothed) by the formula 



3=0 

where R{j) is reaction of the smoothing filter to a single value equal to 1. To conserve 
the scaling of the output function, we subject R{j) to the normalizing condition: 

OO 

Y.R{j) = 1 

3=0 

The most convenient options for R{j) are: 

• Exponent R{i) = (1 — q)q'^, where i = 0, 1, . . .; 0 < g < 1. This gives a recurrent 
formula: 

G(*) = qGit - 1) + (1 - q)F{t) 

To effectively determine the result by few recent values of input function, q should 
be in the interval 0.5 . . . 0.7. 




Text Segmentation into Paragraphs Based on Local Text Cohesion 



163 



• Symmetric peak taking three adjacent values of the input function: i?(0) = i?(2) = 
q/{\ + 2q); i?(l) = 1/(1 + 2q)\ 0 < g < 1, so that 

This option is not recurrent but also simple, since it stores only two previous values 
of F on each step. For q= 1, the three adjacent values of the input function are summed 
up with the equal weights 1/3 and the smoothing is the greatest, for g = 0 the smoothing 
is absent. 

The dispersion of independent input peaks decreases at the output 

• by (1 — qY / (1 — q^) for the exponent (e.g., for q = 0.5 the decrease is 3); 

• by (1 + 2g^)/(l + 2q)'^ for the peak (e.g., for q = 0.5 the decrease is 2.33). 

At the same time, all slow components of F{i) comes through the hlter unimpeded. 
The second source of the cohesion function randomness is the inevitable incom- 
pleteness of any DB of collocations. For any given language, in order to collect the 
collocations covering an average text up to, say, 95%, it is necessary to scan through 
(automatically, with further manual control and post-editing) such a huge and polythe- 
matic corpus of text that this needs too much labor. What is more, natural language is 
not static and new candidates for stable collocations appear in texts continuously. 

In such a situation, it is more convenient to normalize the smoothed cohesion curve 
somehow. For this reason we propose to form the current mean value of the function 
Q{i) given by the formula 4. The mean value is calculated through the whole document 
beginning from / = 1 by the recurrent formula 

i i 

After passing several sentences, the current mean value experiences little fluctuation. 
Dividing G{i) by M (i), we obtain a normalized function that fluctuates respecting to 1, 
with local maximums near the middle points of sentences and comparable minimums at 
their ends. Besides of partial compensation of the DB incompleteness, the normalization 
decreases the arbitrariness of the selection of the functions U and V, especially with 
respect to the fertility of lexemes. 

5 Splitting Text into Paragraphs 

Now let us use the normalized cohesion function for splitting a text into paragraphs. We 
take into account the following considerations: 

• The sequential point of splitting should be near a minimum of the normalized curve. 

• The selected local minimum should be less than the recent minimums that have not 
been admitted paragraph boundaries at the previous steps of the algorithm. 

• Usually an author unconsciously has in mind a mean length value P of a paragraph. 
If the distance from the current point i to the initial point j of the given paragraph 
is fairly less than P, the current cohesion measure is not so important, but near P, 
any noticeable minimum implies the decision to interrupt the paragraph. 
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These requirements are met by the continuous comparison of G{i)/M{i) with the thresh- 
old 

Cii,j) = Co + Cs{{i-j)/{P + AY 
where Co G [0.05 ... 0.2]; C^ = 1 - C„; s G [3 . . . 5], Zi G [1 . . . 3]. 
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Fig. 1. Correlation between cohesion-driven functions. 



As soon as C(i, j) crosses G{i) /M{i) in a point zq, the word before the nearest full 
stop to the left of zq is taken as the end of the current paragraph, and the splitting algorithm 
continues scanning the text. The relations between functions F{i), M{i), G{i)/M{i), 
and the threshold value G{i, j) are illustrated in Figure 1. 



6 A Simple Experiment 

For the simplest experimentation with the proposed algorithm, we have taken an article 
from Mexican newspaper with the following features: 997 words, 27 sentences, and 1 1 
paragraphs selected by the author. All syntactic, pseudo-syntactic, and semantic links 
where marked manually beforehand. 

The algorithm was applied to the text lacking paragraph boundaries, using the fol- 
lowing parameters: a = 5jL, /3 = 2jL, S = 0.75/ L, A = B = 1, q = 0.6, Cq = 0.1, 
s = 4, Z\ = 3. The results were measured by recall and precision as compared to 
boundaries selected by the author. 

Also we proposed the same task to three experts. The results of all experiments are 
gathered in the following table: 





Boundaries selected recall 


precision 


Algorithm 


9 


0.60 


0.66 


Expert 1 


9 


0.50 


0.66 


Expert2 


6 


0.50 


0.83 


Experts 


13 


0.80 


0.61 



One can see that the algorithm restores the paragraphs boundaries not worse than 
educated native speakers of Spanish. The results seem not persuasive but promising. 
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7 Conclusion and Future Work 

A method of splitting text into paragraphs is proposed. It is based on the supposition that 
such splitting is implied by a measure of current text cohesion. The cohesion function is 
constructed basing on close co-occurrences of words pairs contained in a large database 
of collocations and semantic links. The computation includes several steps: estimation 
of the cohesion function, its smoothing, normalization, and comparison with a variable 
threshold depending on the expected paragraph length. Our preliminary experiments 
show promising results. 
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Abstract. This paper describes an approach to the design of an information re- 
trieval of providing an search of users. Textual analysis is a part of information 
treatment systems. Next generation of information systems will rely on collab- 
orative agents for playing a fundamental action in actively searching and find- 
ing relevant information in complex systems. The explosive growth of Web sites 
and Usenet news demands effectives filtering solutions. The access to digital data 
through WFB servers is facilitated by search engines. A number of Internet search 
engines provide classified search directories. The aim of the present paper is to 
suggest a method of filtering based only on the address URL, titles, abstracts. 
The problem of information searching in texts is mainly a linguistic problem. The 
objective is to construct a system for access and filtering information with using 
the model of Noun Phrases (NP). The intensional predicate and NP are used from 
retrieval, navigations (discrete & continue) and filtering the solutions captured 
from the WEB. 



1 Introduction 

The access to the information through the WEB servers is very extremely used by 
seekers. Following a request that is formulated by means of an exploitation engine, the 
user receives on this screen masses of WEB pages. The user visualizes tools that allow 
to filter the information of all pages WEB. With the widespread stored information in 
Web, it is becoming increasingly important to use automatic methods for filtering such 
information [1]. The goal is to propose a method of filtering based on address URL, titles 
and abstracts. Following request, the user visualizes masses of the obtained WEB pages. 
However, the selection of documents becomes very difficulty due to no-relevant of the 
obtained documents. Generally, the user visualizes the first pages but he doesn’t consult 
the hundred ones. It is a difficult to analyzing the pertinence of documents obtained. He 
has to have some tools that allow to filter the information of all web pages. This step 
is a part of the user profile modeling as a tool in order to access to information. This 
filtering will allow to constitute a totality of solutions between the framework of the 
modeling of needs oriented of the user. The module is using classification algorithms 
to extract more relevant ‘terms’ in titles and abstracts, given texts accepted and rejected 
interactively by the user in the process of filtering. This filtering will allow to constitute 
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a set of filtered solutions in order to improve the reformulation of the question. This 
approach favors the textual analysis (reflection of the information producer) in order to 
end in a representation of meaning [4], [7], [11]. 



2 Integrating Linguistic Resources for Information Retrieval 

The indexing of a document is a representation of the document so as to facilitate the 
obtaining of the included information. It is the passage from the textual document to 
internal representation [2]. This study is based on linguistic techniques to optimize the 
following aspects: 

- On improvement of the automatic indexing that is based on an extraction of text 
references in order to make a good representation of its content. 

- On adequate analysis of the request users in order to satisfy its informational needs. 

The problem of information searching in texts is mainly a linguistic problem. The objec- 
tive is to construct a system of automatic indexing that uses the model of Noun Phrases 
(NP). 

2.1 Noun Phrase (NP): Referential Function and Indexing 

This representation has to have the semantic characteristics of this document. It has 
been shown that the NP can be dehned as a continuation of free predicates [8] that 
is constructed around a name. The NP makes a direct reference to an extralinguistic 
element in a hxed universe as like in the following example. 

< 1 > <The policy> <NP>=<The + policy>=<quantifier + predicate> 

According to Le Guern, it has been seen that the NP^ are the themes. Thus, it is possible 
to make a correspondence between extracted NP of a text by a system and the descriptors 
that result from a manual indexing [2]. The extraction of NP is therefore determinate to 
be able to optimize an automatic indexing [2], [8], [13]. The central predicate ’policy’ 
is an intensional element. However, it is possible to consider it as an open intensional 
predicate in order to access to the NP after its referential closing down in a documentary 
research [8]. 

2.2 Inclusion Relations in a NP 

The following example show that some information on the predicates around of ‘syn- 
tagm’ center are possessed and also on that NP that are included in other (NP). 

< 2 > <The policy ofFrance> 

<The policy of <France> m p> n p 

The NP <France> is included in the NP_sentence: <The policy of France> NP 

This appurtenance relation determines some levels. It has been shown that it is possible 
to define several inclusion levels with set theory [8]. Therefore, it is possible to attribute 
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to NP level 1 if it is simple, level 2 if it contains a simple NP, level 3 it contains a NP 
of level 2 and so on.. .(thus of continuation). On one hand, it can be thought that this 
process can be extended to other levels, on the other hand, it seems that this processes 
is limited in French [9]. In the framework of management of answers, the information 
provides by the automatic system on the inclusions between NP can be useful for ori- 
ented interrogations. The advantage of this viewpoint, by grouping referential objects in 
textual set, is to illustrate the composition of intensional predicates. 

3 Hidden Information: 

Schema of Interrogation Based on Linguistic Relations 

Documentary research is the mode that seems to match better for the user. The users 
questions in natural language with will be explained by Information System in order 
to return the most relevant answers of system. In order to compare a question with the 
stocked documents in the database, the request will be analyzed according to the classical 
formalization, so that, its referential terms can be extracted. Therefore, the extraction of 
content can be carried out by logical representation. In this case, the provider solution 
to the user is that witch answers its request (and only this one) [4], [7], [11], [12]. 

The suggested interrogation schema are based on logical approach and was developed 
in previous work. The difference was made between intensional free predicates and 
closed predicates (NP). However, this distinction allows to analyze the interrogation 
problem according as these elements are intensional properties without reference to 
a fixed universe (the predicates are analyzed with intensional logic) or are referential 
functions linked well linked to well defined with the true value ( the noun phrases are 
analyzed with classic logic) [ 8 ]. 

3.1 Left Expansion of Intensionnal Predicate 

The pertinence could be tried to relations between the NP. The information levels are 
found in the appurtenance bonds between the words of textual sequence as shown in the 
following example: 

< 3 > <Les conditions de travail des salaries des entreprises de la capitale> 

< 3 > /The /conditions /of /work /of /the /workers /of /the /enterprises /of /the /capitale/ 

[Xes conditions de travail _des salaries _des entreprisese de la capitale] level 0 

NPi [la capitale] level 1 
NP 2 [les enterprises de la capitale] level 2 
NP 3 [les salaries des enterprises de la capitale] level 3 
NP 4 [Les conditions de travail des salaries des entreprises de la capitale] level 4 

We can see the following characteristics: 

- We call NP 4 of level 4: [macro JSlP_final]. This hnal NP 4 is the NP that contains all 
the other NP with low level. 

- NP 3 , NP 2 are called respectively [macroJSfP] expansion of level 3 and level 2. 

- NPi is called [micro JSIP] of level 1. 

- The intensional predicates have of level 0. 
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3.2 Right Expansion of Intensionnal Predicate 

The pertinence could be tried to relations between the NP. The information levels are 
found in the appurtenance bonds between the words of textual sequence as shown in the 
following example: 

< 4 > <The enterprises of the capitale> 

[_les entreprises_de la_capitale] level 0 

NPi [les enterprises ] level 1 

NP 2 [les enterprises de la capitale] level 2 

This gradual process of levels determine one inclusion between NP. This relation 
reflects the links between referential objects of textual structure. 

3.3 Information Retrieval: Choice of Navigation Path 

- Information Retrieval by intensionaLpredicate: The database has to provide to the 
user all the NP, in priority, that contain the intensional predicate as the center of NP. If 
these documents do not answer to the needs of the user, then it is to possible to provide 
him all the NPs with upper lever {greater level ) which contains the intensional 
predicate as the center of NP. In the case, where this intensional predicate appears 
in the shape of complex word, at first the micro J^P which contains this intensional 
predicate is proposed in order to avoid noisy solutions. 

- Information Retrieval by micro J^P: The database must to provide to the user all the 
NP, in priority, that contain the micro-NP. If these documents do not reply to needs 
of the users, then we can provide him with the NP of upper level {macro .NP.endl) 
that contains this micro J^P or lower level. 

- Information Retrieval by left or Right expansion of the intensionaljpredicate: The 
databases must to provide to the user all NP, in priority, that contain the intensional 
predicate as the center of NP. If there are many documents that reply to the needs 
of users, it is possible to select the NP in this intensional predicate with lower level. 
This operation of information reduction can be continued by users. 

- Information Retrieval by macro JNP -end. The databases must to provide to the user 
all the NP, in priority, that contain the macro-NP-end. If these documents do not 
reply to needs of the users, it is possible to select the NP in this macro_NP with 
lower level and thus of continuation. 

3.4 Schema of Filtering Navigation: Discrete and Continne Navigation 

The different interrogation manners are summarized in the following Figure 1 

The schema illustrated a set of solution even of the most noisy and gives the choice 
to the user to satisfy his demand. The manner permits to user to mark the susceptible 
solutions of his demand. In order to achieve this marking, the user has to be able to move 
in the structure produced by different levels (discrete and continue navigation). This 
is what we call the Altering of answers in cooperative/collaborative mode [3], [10]. To 
measure the importance of NP relations in indexing documents by the search engines, 
next natural questions are tested on the web. 
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Optimization 'Filterinz 



Outputs '.Retrieved Documents 




macro NP end 



expansion (left, right) pccdicatc 



Fig. 1. Hidden Information Retrieval: path of navigation continue and discrete 



4 Search Engines 

Search engines have developed in order to look for information stored on the Web. Two 
types of robots are distinguished: the indexes and the descriptors: 

- Indexes engines coves all web servers, and enrich automatically (enlarge) the direc- 
tory by indexing the contents (titles, abstracts, texts) 

- Descriptors engines that have titles as basis or descriptions provided by the designer- 
web. 

Among search engines that combine the two techniques, it there has Francite, We- 
bCrawler, Excite, ...The results of test on the WEB [5] are: 



Search engines 


les 


conditions 


de 


travail 


des 


salaries 


entreprises 


la 


capitale 


Lokace 


434312 


60036 


643208 


96116 


491567 


7651 


85240 


530821 


10396 


Francite 


0 


1314 


0 


3443 


0 


242 


4524 


0 


663 


AV 


12660474 


77343 






1811 


665938 


2870612 




183585 


Etc.. 





















The problem of interrogation in natural language can generated the no-pertinent 
information for the research. However, if the user uses an important textual sequence 
a very long sentence such as the question , the system gives false information (HotBot 
gives 4306052 results for the quantifier ‘1^ ). 

AltaVista gives the following results: les ( 12660474); conditions de travail ( 77545); 
des (1811); salaries (665938); entreprises (2870612); la capitate (183585). 
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This situation produces the ambiguities because of the number free predicates ( the, 
of, each, in, stops word, etc. ) component of the request. We notices that the answers from 
search engines (Google, AltaVista, Yahoo, Nomade, ...) presents a structure of metadata. 
This structure is constituted of categories: titles, under-titles, abstracts and URL. 



5 Building Databases from Web-Data (Strategic Information) 

This modeling of users needs follows a preview process. A solution would be to present 
all answers (even de most noisy for intensional predicates) and to let the choice to user 
to satisfy his demand. Other solution would be to determine the NP of the question and 
he compare them to NP solutions of titles and abstracts in order to improve the filtering. 
The goal is to make an syntactical analysis on the contents of title, and downloaded 
abstracts and to represent the NP solutions in order to construct of database filtered 
an indexed databases for the information research. This optic obliges the databases to 
provide to the user on all the NP that answers his question. This optic of marking the set 
of solution would permit to filter the information due to the existence of a mark. When 
the system produce different solutions, the user has to select the best solutions and/or to 
call automatic analysis by agents of filtering [6]. To reaper the information, the strategies 
based on the algorithm of classification allows the filtering. It should be noted that the 
access to content of document is not obtained by such methods. The general process will 
be completed by a linguistic procedure of filtering: 

1. Filtering lexical of intensional predicates (simple or complex) of texts of Database 
Semi-Structured, 

2. Syntactical analysis on the texts of Database (only titles, abstracts), 

3. Classification by order of NP in titles, in abstracts and in the documents, 

4. The user consults the list of NP titles in priority, 

5. Presentation of solutions in an order, 

6. Building a databases of digital collection (strategic information) with the NP and 
expansion of NP. 

7. Choice, evaluation and access to path. 

8. Future queries tested on the Database Semi-Structured. 

In the information process, integration is a key element. The Internet contributes to in- 
formation analysis in many ways, such as by navigation between sources through hyper- 
links. Combining the search engines approach together with the more classical database 
approach for creating a database, we can call ‘virtual collection” means identifying the 
sources and indexing these databases for direct searching and direct use. 

6 Conclusions 

This study propose two complementary methods to conceive documentary system. The 
data first one emphasis capturing textual data with quantitative algorithm of filtering 
based on the measure of implication between the question and the titles, abstracts and 
documents. The second one adopts a method that focuses on the role of the user and on 
his knowledge filter the relevant answers from the Web. 




Building a Digital Collection of Web-Pages 



173 



The approach which allows the user to navigate and inspect the database documents 
captured according his demand. The future search strategies proceed on the Database 
Semi-Structured (only URL address, titles, abstracts) to look for relevant document. The 
aim is to constitute de databases of strategic information. 

In the case of an Information Retrieval, there is no correspondence between the set 
of reference (NP) that the user wants and the set of reference that the system is going to 
suggest to him. To limit the noisy/silence problem, we have to call the linguistic tools. 
We suggested some elements to study the different schema of interrogation The notions 
of intensionaLpredicate, micro-NP, macro JSIP, macro-NP-cnd are used to introduce the 
different navigation paths. The research will be oriented toward complying of linguistics 
techniques with filtering tools. 
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Abstract. Before building a full wsd system it is necessary to have 
a balanced and representative corpus annotated with sense tags. This 
requirement is not certainly fulfilled for the Czech language. Thus, we 
decided to develop some particular methods for annotating texts and we 
have started with the most common nouns. In our approach, the disam- 
biguation algorithm based on sets of words (called bags) was used. The 
advantage of this approach is the possibility of filling bags in various 
ways. Our ultimate goal is to reduce manual work as much as possible. 
Here we present three basic ways of filling bags. The first one is based on 
the machine readable version of SSJC, the second takes the advantage of 
learning from manually annotated text and the strategy of pseudoclus- 
tering is the third one. 



1 Introduction 

The solution of word sense disambiguation (wsd) problem is an essential step 
on the way to the progress in many branches of natural language processing, 
mainly in the machine translation and information retrieval. The task of such 
a disambiguation is to determine correct meanings of words in texts. The basic 
scheme can be described in two steps. The first one is based on summarization 
of senses and the second one tries to associate the correct meanings with the 
word occurrences. 

Having a semantically well-annotated corpus is a starting point for creat- 
ing a good wsd-system because such corpus provides useful data for writing 
linguistically-oriented and more sophisticated rules suitable for purpose of wsd. 
Such a corpus can provide data for testing new methods of disambiguation. But 
the problem is that for a Czech language there is no satisfactory, large and repre- 
sentative corpus which would be tagged by senses. We would like to create such 
a corpus finally, but at the present stage of the research we deal with the devel- 
opment of some methods leading to an easy and reliable semantic tagging of the 
most common nouns. In this article, we try to show some of these methods. We 
think, that it is necessary to put stress on the reduction of manual work and, at 
this phase, prefer precision to recall. If we achieve high precision in determining 
the senses we will be able to use another strategy for tagging the rest of the text, 
e.g. bootstrapping. 
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2 Algorithm and Data 

We use a bag-of- words algorithm for three approaches to wsd. These approaches 
differ one another from various technigues of gaining and filling bags of words. 
The semantically tagged corpus is used for testing results. 

2.1 The Basic Principles of the Algorithm 

In our experiments we started from Desk’s approach [2] which uses machine 
readable dictionaries. Nevertheless, we have changed this approach. A text for 
the disambiguation and the sets of sensitive words (called bags) for each meaning 
of target words represent an input of the algorithm. In contrast to Desk, bags 
in two of three of our experiments are not filled entirely with the dictionary 
definitions. The algorithm counts overlaps between bags and context of target 
words and declares the sense with the most overlaps a winner. In other words, the 
algorithm finds a bag that displays most of the words contained in the context. 
The advantage of this algorithm over other algorithms based on the techniques 
of artificial intelligence consists in possibility editing, examining and changing 
these sensitive words. 

2.2 Multidimensional Space of Metrics 

The algorithm has many parameters and their settings influence the results. 
Finding the optimum, i.e. the best combination of the setting, is not a definite 
solution to the hard problem mainly because the optimum settings depend on 
the target word. Another trouble is a large combinatoric complexity of finding 
the optimum. Many metrics have two-edged nature because the change of their 
parameters increases recall in favour of precision and conversely. We will mention 
some the most important metrics: 

— The length of the context. 

— The context window. (It reflects the fact that words occuring closer to the 
target are more important.) 

— The lemmatization of the context. (We did not apply a lemma disambigua- 
tion this time, but we use ajka analyzer for the lemmatization [5].) 

— The stop- list. 

— The threshold for a tip of the winner. (We can say that the score of similarity 
has to be greater than a given threshold.) 

~ The bag. (We discuss it below.) 



2.3 Manual Tagging and Structure of Tags 

In our experiments we work with the corpus DESAM [4] developed at FI MU. It 
contains approximately 1 million positions and includes mostly the newspaper 
texts. We manually annotated target words selected from this corpus (approx- 
imately one thousand tags) by using corpus editor CED [6]. The most diffi- 
cult problem in this respect was to find the individual sense, because quite fine 
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grained classification can be used. Typically more detailed differentiation of the 
senses is more difficult for the disambiguation, but it is not a rule. We have 
been inspired by the sense differentiation in SSJC [9] and by semantic network 
EuroWordNet [7]. This manual annotation gave us basic preview of the sense dis- 
tribution. The following example for target srdzka/y does not include all subtle 



senses: 
sense 1: 


havarie (crash) 


15 occurrences 


sense 2: 


sarvatka (melee) 


15 occurrences 


sense 3: 


redukce (reduction) 


6 occurrences 


sense 4: 


pocasi (precipitation) 


54 occurrences 



3 Gaining Sensitive Words 

The algorithm uses the sets of the words (bags) for various senses. There are 
some possibilities of filling the bags. We have tested three of them. 



3.1 Filling Bags from SSJC 

We could work with the machine readable version of SSJC [9]. It contains de- 
scription of the senses and we used it for filling the bags. Each word form was 
lemmatized but not disambiguated thus all lemma variants of word were associ- 
ated with it. The common words were eliminated by using a stop-list. This was 
performed automatically. Finally, we compared the program tagging with the 
manually pretagged DESAM. 

We carried out some experiments for the words: srdzka/y (with 90 tokens), 
cena (212), zdpad (91), vazba (111) and fronta (73). The largest mistakes in bags 
were corrected manually. A small change in the bag brings about a rapid change 
in the results. 



Table 1. The results of SSJC approach (precision/recall). 



the length of context: 5 positions 50 positions 


srazka 


91 


/ 


39% 


87 


/ 


85% 


cena 


93 


/ 


32% 


87 


/ 


80% 


zapad 


50 


/ 


38% 


52 


/ 


80% 


vazba 


53 


/ 


39% 


56 


/ 


72% 


fronta 


92 


/ 


50% 


65 


/ 


82% 



It is surprising to observe such differences between various targets. It is caused 
by too theoretical a description of the senses in the dictionary. The algorithm 
did not operate well for the target words zdpad and cena. The targets srdzka 
and fronta have their senses in different topic areas and that is the reason of the 
higher precision. The sense hodnota/worth is detected very badly by the target 



cena. 
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We have performed the experiments for the context of the length of 50 po- 
sitions (words) as well. The precision dropped a little and recall increased con- 
siderably. The conclusion is to begin disambiguation with a shorter context and 
then the rest could be disambiguated within longer context. 

We have measured how the results change if we use another window metrics 
that favoured the words closer to the target. We have introduced the three 
new metrics: parabolic {—cP), hyperbolic (l/d) and pseudo-hyperbolic (l/Vd), 
where d is the distance from a target word. Generally, the values of precision did 
not change too much, except for the target srdzka which dropped most in the 
hyperbolic metrics and the target fronta which increased in the same metrics. 
A good response comes from recall, because these characteristics improved for 
each of the three new metrics by 5 to 12%. 

3.2 Simple Learning 

We have tried a simple learning technique thanks to manual tagging of DESAM 
in the previous step. We used a part of the manually tagged DESAM for learning 
and a part for testing. The bags were created at the training stage, but there 
was a difference in the approach based on SSJC because each word in the bag is 
associated with its weight. There are some ways to select the rate of the number 
of training to testing contexts. The following table contains the results for the 
rate of one to two and it holds for each sense of the target word. 

Table 2. The results of simple learning (precision/recall). 



the length of context: 


5 positions 


20 positions 


srazka 


94 / 36% 


79 / 64% 


fronta 


96 / 52% 


82 / 77% 


vazba 


90 / 44% 


87 / 86% 


zapad 


92 / 27% 


67 / 86% 



We can see that the precision is higher than in the approach based on SSJC. 
Because of a low number of tokens and newspaper nature of DESAM we tested 
the disambiguation of the target word fronta in the bigger corpus ESO which 
has 1890 occurrences of fronta. We introduced the new sense mladd fronta which 
is a collocation and it denotes the title of the newspaper. It is an example of how 
adding the new sense can improve the results. We achieved precision of 95% and 
recall of 45% and these results do not differ too much if we use training data 
from the corpus DESAM or ESO, and if we change the length of context from 
5 to 20 positions. The largest defects were found in the detection of the sense 
hojovd/ combat. 

3.3 Pseudo-clustering 

In the manual tagging, there were problems associated with the differentiation 
of senses. Thus we have decided to try out some cluster techniques which would 
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automatically distinguish the senses. The suggested algorithm finds the most 
frequent words from the context, links them to each other and creates the matrix 
of the word bigrams. Each cell of the matrix contains the number of contexts 
where both the words occur. Then, bigrams were chosen from the most to the 
least frequent and if both the words have not been selected yet, then they created 
a new cluster. If one of the words was not selected, then it was linked to the 
cluster belonging to the second word and if both the words have already been 
selected, then they were ignored. The words are nodes of a graph and the bigrams 
are edges. The algorithm is similar to the construction of a span-tree, but it puts 
together the nodes that have not been selected yet. 



Table 3. The bigram matrix for target fronta. 



fronta mlady narodni vydat strana dries cekat dlouhy 



mlady /young 


4 


82 


29 


80 


1 


3 


narodni/national 


- 


5 


81 


3 


0 


0 


vydat /bring out 


- 


18 


12 


1 


0 


strana/party 


- 


- 


31 


2 


1 


dnes/today 


- 


- 


- 


5 


3 


cekat /wait 


- 


- 


- 


- 


26 


dlouhy/long 


- 


- 


- 


- 


- 



The most frequent pair is mlady-vydat and it creates the cluster No. 1 in this 
example. The pair ndrodm- strana creates the cluster No. 2, dues is added to the 
cluster No. 1 because of the pair mlady-dnes etc. Then the pair strana-dnes leads 
to a conflict and it is ignored because strana is contained in the cluster No. 2, 
while dnes belongs to the cluster No. 1. It is caused by the ambiguity of the word 
strana /party , page. As a result, we get the following clusters: 

— mlady, vydat, dnes 

— ndrodm, strana 

— cekat, dlouhy 

We carried out some other experiments using these techniques. We tried out 
various numbers of the most frequent words and various lengths of the context. 
The word clusters were obtained and they were manually modified a bit and 
used as the bags for our algorithm. The precision was that of about 90% at that 
moment, but better results can be expected if we try to add a word to the cluster 
that meets a specific condition (getting over a given threshold). Self-extracting 
senses and pragmatic orientation in a given corpus are the advantages of this 
method. A little disadvantage consists in the manual elimination of some clusters 
which proves to be necessary. 

4 Conclusions and Future Work 

In this paper we have described the initial phase of creating a word sense dis- 
ambiguator for Czech based on the bag-of-words approach. We evaluated the 
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multidimensional space of metrics whose settings play a role in the results of the 
disambiguation. We mentioned our three ways method of filling the bags and 
tested the method on the selected group of Czech nouns. 

The first approach leads to the conclusion that electronic versions of dictio- 
naries, such as SSJC, are not quite appropriate because they are too theoretical 
and they prefer the description of senses to examples in some cases. Primitive 
learning from a manually annotated corpus offered satisfactory results in the 
second method, but it cannot be regarded as an acceptable solution because of 
the great deal of manual work. In addition, our algorithm of pseudo-clustering 
provided the suggestion of sense division and reduced the manual work. We are 
convinced that its results can improve. 

The next step for us to make is to improve the pseudo-clustering and try to 
combine our three approach in a hybrid disambiguator. After that we would like 
to use bootstrapping techniques and try to tag a corpus for more target words. 
This corpus would be used in the next research. One of our ultimate goals is to 
develop a tool that could process raw corpus texts with an acceptable reliability. 

References 

1. Me, N., Veronis, J.: Introduction to the Special Issue on Word Sense Disambigua- 
tion: The State of the Art, Computational Linguistics, Vol. 24, Num. 1, 1998. 

2. Lesk, M.: Automatic Sense Disambiguation Using Machine Readable Dictionaries: 
How to Tell a Pine Cone from an Ice Cream Cone. Proceedings of SIGDOC, Toronto, 
1986, pp. 1-9. 

3. Pala, K.: Word Senses and Semantic Representations. Can We Have Both?, Text, 
Speech and Dialogue: Proceedings of TSD’OO Workshop, LNAI 1902, Springer, 2000, 
pp. 109-114. 

4. Pala, K., Rychly, P., Smrz, P.: DESAM - An Annotated Corpus for Czech, Pro- 
ceedings of SOFSEM’98, Springer, 1998. 

5. Sedlacek R., Smrz P.: Automatic Processing of Czech Inflectional and Derivative 
Morphology, Technical Report, Faculty of Informatics, Masaryk University, Brno, 
2001 . 

6. Veber M.: CED - Program for Corpora Editing, Technical Report, Faculty of Infor- 
matics, Masaryk University, Brno, 1999. 

7. Vossen, P., et ah: Set of Common Base Concepts in EuroWordNet-2, Final Report, 
2D001, Amsterdam, October 1988. 

8. Wilks Y., Stevenson M.: Sense Tagging: Semantic Tagging with a Lexicon, Proceed- 
ings of the SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What 
and How?, Washington, D.C., 1997. 

9. Slovnik spisovneho jazyka ceskeho (Dictionary of literary Czech), Akademia, Praha, 
1960, electronic version, Praha, Brno, 2000. 




Method for WordNet Enrichment Using WSD 



Andres Montoyo^, Manuel Palomar^, and German Rigau^ 

^ Department of Software and Computing Systems, 
University of Alicante, Alicante, Spain 
{montoyo, mpalomar}(§dlsi .ua. es 
^ Departament de Llenguatges i Sistemes InformMics, 
Universitat Politecnica de Catalunya, 08028 Barcelona, Spain 
g . r igauSlsi . upc . es 



Abstract. This paper presents a new method to enrich semantically 
WordNet with categories from general domain classihcation systems. The 
method is performed in two consecutive steps. First, a lexical knowl- 
edge word sense disambiguation process. Second, a set of rules to select 
the main concepts as representatives for each category. The method has 
been applied to label automatically WordNet synsets with Subject Codes 
from a standard news agencies classification system. Experimental re- 
sults show than the proposed method achieves more than 95% accuracy 
selecting the main concepts for each category. 



1 Introduction and Motivation 

Many researchers have proposed several techniques for taking advantage of more 
than one lexical resource, that is, integrating several structured lexical resources 
from pre-existing sources. 

Byrd in [3], proposes the integration of several structured lexical knowledge 
re-sources derived from monolingual and bilingual Machine Read Dictionaries 
(MRD) and Thesauri. The work reported in [19] used a mapping process between 
two thesauri and two sides of a bilingual dictionary. Knight in [7] , provides a def- 
inition match and hierarchical match algorithms for linking WordNet [9] synsets 
and LDOCE [15] definitions. Knight and Luk in [8], describe the algorithms for 
merging complementary structured lexical resources from WordNet, LDOCE and 
a Spanish/English bilingual dictionary. A semiautomatic environment for linking 
DGILE [2] and LDOCE taxonomies using a bilingual dictionary are described 
in [1]. A semi-automatic method for associating Japanese entries to an English 
ontology using a Japanese/English bilin-gual dictionary is described in [13]. An 
automatic method to enrich semantically the monolingual Spanish dictionary 
DGILE, using a Spanish/English bilingual dictionary and WordNet is described 
in [16]. Several methods for linking Spanish and French words from bilingual 
dictionaries to WordNet synsets are described in [17]. A mechanism for linking 
LDOCE and DGILE taxonomies using a Spanish/English bilingual dictionary 
and the notion of Conceptual Distance between concepts are described in [18]. 
The work reported in [4] used LDOCE and Roget’s Thesaurus to label LDOCE. 
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A robust approach for linking already existing lexical/semantic hierarchies, in 
par-ticular WordNet 1.5 onto WordNet 1.6, is described in [5]. 

This paper presents a new method to enrich WordNet with domain labels 
using a knowledge based Word Sense Disambiguation (WSD) system and a set of 
knowledge rules to select the main concepts of the sub hierarchies to be labelled. 
The WSD system used is the Specification Marks method [11]. 

The organisation of this paper is as follows: After this introduction, in Section 
2 we describe the technique used (Word Sense Disambiguation (WSD) using 
Specification Marks Method) and its application. In Section 3 we describe the 
rules used in the method for labelling the noun taxonomy of the WordNet. In 
section 4, some experiments related to the proposal method are presented, and 
finally, conclusions and an outline of further lines of research are shown. 

2 Specification Marks Method 

WSD with Specification Marks is a method for the automatic resolution of lexi- 
cal ambiguity of groups of words, whose different possible senses are related. The 
disambiguation is resolved with the use of the WordNet lexical knowledge base 
(1.6). The method requires the knowledge of how many of the words are grouped 
around a specification mark, which is similar to a semantic class in the Word- 
Net taxonomy. The word-sense in the sub-hierarchy that contains the greatest 
number of words for the corresponding specification mark will be chosen for the 
sense-disambiguating of a noun in a given group of words. We should like to 
point out that after having evaluated the method, we subsequently discovered 
that it could be improved with a set of heuristics, providing even better results in 
disambiguation. Detailed explanation of the method can be found in [12], while 
its application to NLP tasks are addressed in [14]. 

3 Proposal for WordNet Enrichment 

The classification systems provide a means of arranging information so that 
it can be easily located within a library. World Wide Web, newspapers, etc. 
Materials are usually classified by their category or class. Therefore, the field 
of human knowledge is divided into major categories, these are divided into 
subsections, and so on. The classification scheme is structured according to the 
state of current human knowledge. On the other hand, WordNet presents word 
senses that are too fine-grained for NLP tasks. We define a way to deal with this 
problem, describing an automatic method to enrich semantically WordNet 1.6. 
with categories or classes from the classification systems using the Specification 
Marks Method. Categories, such as Agriculture, Health, etc, provide a natural 
way to establish semantic relations among word senses. 

3.1 Method 

In this section we describe, in detail, the method employed for enriching WordNet 
1.6. The group of words pertaining to a category, that is, to be disambiguated 
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come from different files of the classification systems. These groups of nouns are 
the input for the WSD module. This module will consult the WordNet knowledge 
base for all words that appear in the semantic category, returning all of their 
possible senses. The disambiguation algorithm will then be applied and a new 
file will be returned, in which the words have the correct sense as assigned by 
WordNet. After a new file has been obtained, it will be the input for the rules 
module. This module will apply a set of rules for finding out the super-concept 
in WordNet. This super-concept in WordNet is labelled with its corresponding 
category of the classification system. This process is illustrated in Figure 1. 




















Fig. 1. Process of WordNet enrichment 



The method performs the following steps to enrich and label WordNet. 

Step 1. Starting with the categories of the classification systems. We would like 
to clear up any ambiguities at this stage. There are words in the categories 
that form two words or more. These word combinations of two or more 
words are not in WordNet, therefore it would be impossible to disambiguate. 
To resolve this problem we use the utility of WordNet “Find Keywords by 
Substring” (grep). This substring is a synset in WordNet and relates to 
the words of the category, (i.e., the substring “Health organization” isn’t 
in WordNet but finding it with this utility we obtain the substring “Health 
maintenance organization” ) . 

Step 2. To locate the synset or number sense associated with each one of the 
words of the category, using the Specification Marks Method. 
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Step 3. To obtain the super-concept from each category, using the hyper/hypo- 
nym relationships in the taxonomy of WordNet. For example, the super- 
concept for disease is ilLhealth. 

Step 4. To label the super-concept, obtained in WordNet, with the category 
belonging to the group of words in the classification systems. For example, 
the super-concept obtained in the step 3 is labeled with Health. 



3.2 Super-concepts Rules 

The way to combine the semantic categories of classification systems and Word- 
Net would be to obtain the super-concept of WordNet for each group of words 
that belong to a semantic category. For obtaining these super-concepts we apply 
the following set of rules. 

Rule 1. If a synset contains only hyponym words belonging to the category for 
disambiguating, it is chosen as the super-concept. The category is assigned 
to that super-concept as to full hyponyms and meronyms. For example, the 
category Health is made up of a group of words including clinic and hospital. 
Rule 2. If the synset selected has a hypernym that is made up of the same 
word as the chosen entry, it is selected as the super-concept. The category 
is assigned to that super-concept as to full hyponyms and meronyms. For 
example, the synset ilLhealth is made up of ill and health and therefore it is 
a hypernym of disease#!. 

Rule 3. This rule resolves the problem of those words that are neither directly 
related in WordNet nor are in some composed synset of a hyper/hyponym 
relationships. We use the gloss of each synset of the hyponym relationship. 
The hypernym of the word disambiguated is obtained in the taxonomy of 
WordNet. Then, all of the other words included in the category in some gloss 
of an immediate hyponym synset of WordNet are checked, and the label of 
the category is assigned to it. Also, this category label is assigned to all the 
hyponym and meronym relationships. 

Rule 4. When the word to be disambiguated is next to the root level, that 
is, in the top of the taxonomy, this rule assigns the category to the synset 
and at all its hyponyms and meronyms. For example, the category Health is 
assigned to injury#^- 

4 Discussion 

The goal of the experiments is to assess the effectiveness of the proposed method 
to enrich semantically WordNet 1.6. with categories from IPTC. Table 1 presents 
some IPTC categories with the different test sets, computed as the amount of 
synsets of WordNet correctly labelled, synsets incorrectly labelled and words 
unlabelled (syn-sets are not in WordNet). 

To evaluate the precision, coverage and recall of the method, we applied the 
rules of the section 3.2 and we hand checked the results for each word belonging 
to an IPTC category. 
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Table 1. IPTC categories with the different test sets 



Categories IPTC 


Total Num- 
ber Words 
IPTC 


Correctly 

Labelled 

Synsets 


inocTTectly 

L^relled 

SyrEets 


Werds 
Un label led 


Ans^culture& oitenainnieiit 


23 


21 


2 


D 


Disasters £ accidents 


ID 


7 


2 


1 


Agriculture 


6 


S 


D 


1 


Chenrical 


9 


S 


D 


1 


CoufHJting^ Technology 


ID 


9 


1 


D 


CoiEtTUCtion ^ ptrperty 


5 


3 


1 


1 


Energy ^ lesaurce 


\4 


ID 


D 


4 


Financial & busines services 


13 


12 


1 


D 


Consutner goods 


ID 


ID 


0 


D 


Media 


12 


12 


0 


D 


Toutistd ^ leisure 


7 


7 


D 


D 
































H«alth 


12 


s 


3 


1 



-lOTAL 


399 


3SS 


16 


25 



Precision is given by the ratio between correctly synsets labelled and total 
number of answered (correct and incorrect) synsets labelled. Coverage is given by 
the ratio between total number of answered synsets labelled and total number of 
words. Recall is given by the ratio between correctly labelled synsets and total 
number of words. The experimental results are those shown in the following 
table. 



% 


Coverage 


Precision 


Recall 


WordNet Enrichment 


93.7% 


95.7% 


89.8% 



We saw that if the Specification Mark Method disambiguates correctly and 
the rules of the section 3.2. are applied, the method works successfully. How- 
ever, if the Specification Mark Method disambiguates incorrectly, the labelling 
of WordNet with categories of IPTC is also done incorrectly. 

5 Conclusions and Future Work 

Several works in the literature [6] have shown that for many NLP tasks the fine- 
grained sense distinctions provided by WordNet are not necessary. We propose 
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a way to deal with this problem, describing an automatic method to enrich 
semantically WordNet with categories or classes from the classification systems 
using the Specification Marks Method. Categories, such as AGRICULTURE, 
HEATH, etc, provide a natural way to establish semantic relations among word 
senses. 

This paper applies the WSD Specification Marks Method to assign a cat- 
egory of a classification system to a WordNet synset as to full hyponyms and 
meronyms. We enrich the WordNet taxonomy with categories of the classification 
system. 

The experimental results, when the method is applied to IPTC Subject Ref- 
erence System, indicate that this may be an accurate and effective method to 
enrich the WordNet taxonomy. 

We have seen in these experiments a number of suggestive indicators. The 
WSD Specification Marks Method works successfully with classification systems, 
that is, categories subdivided into groups of words that are strongly related. 
Although, this method has been tested on IPTC Subject Reference Systems, but 
can also be applied to other systems that group words about a single category. 
These systems are Library of Congress Classification (LC), Roget’s Thesaurus or 
Dewey Decimal Classification (DDC). 

A relevant consequence of the application of the Method to enrich WordNet 
is the reduction of the word polysemy (i.e., the number of categories for a word 
is generally lower than the number of senses for the word). That is, category 
labels (i.e.. Health, Sports, etc), provide a way to establish semantic relations 
among word senses, grouping then into clusters. 

Furthermore, now we able to to perform variants of WSD systems using 
domain labels rather than synset labels [10]. 
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Abstract. We show that data-guided techniques optimized for classification of 
speech sounds into context-independent phoneme classes yield auditory-like fre- 
quency resolution and enhanced sensitivity to modulation frequencies in the 1- 
15 Hz range. Next we present a viable recognition paradigm in which temporal 
trajectories of critical band spectral energies in individual critical bands are used 
to yield estimates of likelihood of phoneme classes. The relative success of this 
technique leads to discussion about auditory basis of human speech communica- 
tion process. Overall, we argue against spectral envelope based linguistic code in 
communication by speech. 



1 General Introduction 

Not all researchers in automatic recognition of speech (ASR) agree that knowledge 
of human speech communication process could be helpful in designing better ASR 
systems. Some properties of human auditory perception such as non-uniform frequency 
resolution are well accepted and found useful in ASR. More complex models of hearing 
are still viewed by the ASR field with a caution. Cognitive scientists may be sometimes 
interested but often perhaps annoyed or amused by the ways ASR is currently done. 
Both disciplines, the ASR and the science of human speech communication, are still 
evolving and it is likely that mutual collaboration could be beneficial to both. ASR 
tries to emulate human cognitive processes and it makes little sense for ASR to ignore 
cognitive science. The fact that ASR at least partially succeeds in recognizing linguistic 
message in speech should be of interest to cognitive scientists. The author of this paper 
is an ASR engineer. The work presents some ASR results that could hopefully be also 
of interest to researchers in human speech communication. 

The paper is organized as follows: First, we describe experiments that attempt to 
optimize feature extraction for ASR. This work yields speech feature extraction methods 
that are consistent with some properties of human hearing and shows that the information 
required for classification of phonemes is distributed over relatively long time interval 
in a time-frequency plane. Next, we present ASR system based on classification from 
temporal patterns of critical-band spectral energies in individual critical bands. The 
relative success of this approach allows for questioning the information-bearing role of 
spectral envelopes of speech and emphasizes role of temporal dynamic in individual 
frequency bands. 
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2 Data-Guided Features 

2.1 Introduction to the Problem 

The core of acoustic processing in ASR is pattern classification of incoming speech 
signal. A classical pattern classification system consists of two modules: 1) feature ex- 
traction, 2) pattern classifier. The pattern classifier is typically trained on some training 
data; the feature extraction module is typically designed based on beliefs and intuitions 
about what is important in the speech signal. The data-guided feature extraction attempts 
to replace the beliefs and intuitions of the designer by knowledge derived from large 
amounts of labeled speech data. A general strategy in the data-guided design of features 
for the classification is to use these large hand-labeled databases to derive such a pro- 
cessing of short-term power spectrum of speech, which would improve classification. 
The Linear Discriminant Analysis (LDA) is used for the optimization. 

Why do we start with the short-term spectrum of speech when in principle, the right 
way would be to start with speech signal since the speech signal is the input to the human 
speech decoding system? The problem we are facing here is that we know quite well 
that human hearing system is sensitive to energy of the signal, i.e there is a need for a 
rectifying nonlinearity. At this moment we do not know how to approach the non-linear 
optimization and how to interpret its results, so we include the rectification of spectrum 
(i.e. computing the power spectrum of speech after carrying out the spectral analysis) in 
the initial processing. 

Why are we attempting to classify speech sounds into context-independent phoneme 
classes when most of advanced ASR systems would use phonemes-in-context as the 
target classes? Early experiments of Fletcher [4] (reviewed in [1]) indicate that human 
listeners are capable of recognizing phonemes in nonsense syllables independently of 
their context. We take this as an evidence that human auditory perception is capable of 
compensating for coarticulation effects that are clearly evident in acoustic speech data 
and which create problems in spectral envelope-based ASR. 

2.2 Two Ways of Applying LDA 

As shown in Figure 1 , depending on the way we form the spectral vectors for the LDA 
optimization, the LDA could yield either optimized spectral basis for the projection of 
the short-term power spectrum on space optimized for the classification, or FIR filters 
for filtering time trajectories of spectral energies in order to improve the classification. 



LDA derived spectral basis. Data-derived spectral basis (LDA matrix for optimal 
rotation of short-term log spectrum) suggests bark-like spectral warping and relatively 
large spectral integration of the order of octave [11]. More details are in [12]. 



LDA derived temporal RASTA filters. Data-derived temporal RASTA [6] FIR filters 
(LDA matrix for optimal rotation of temporal vectors of critical-band energies) suggest 
the need for alleviating spectral envelope changes below 1 Hz and above 15 Hz [15,8]. 
This is achieved by RASTA FIR filters with impulse responses which resemble Mexican 
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LDA gives basis for projection of spectral space 




/]/ /u/ /o/ /]/ /o/ /]/ /o/ 

— ' time 



LDA gives FiR filters for filtering time trajectories 



O" 



of short-term spectral energies 




/j/ /u/ /a/ /j/ /o/ /j/ /o/ 

— ► time 



Fig. 1. Depending on a way of forming the initial spectral vectors for the LDA optimization, the 
LDA may yield either spectral basis or FIR filters. 



hat-like (difference of two Gaussians) function and its temporal derivatives. The filter 
impulse response span relatively large chunks of signal (of the order of 0.5 sec or more), 
suggesting that for the optimal classification of speech signal into phonetic classes, 
speech segments that are longer than the average length of the phoneme (which is about 
70 ms in our database) is required. More details can be found in [16]. 



2.3 Discussion 

The Bark-like spectral resolution is well accepted in ASR, Dominant ASR feature ex- 
traction techniques are based on frequency warped short-term spectrum of speech where 
the spectral resolution is better at low frequencies than at high ones. Use of such a spec- 
trum is motivated by properties of human auditory perception. What makes LDA-derived 
spectral discriminants interesting is the fact, that no knowledge of spectral resolution of 
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Discriminant Vector 1 Discriminant Vector 2 






Fig. 2. Spectral basis derived from LDA analysis of short-term Fourier spectra of about 2 hours 
of phoneme-labeled telephone speech from multiple speakers. 



human auditory system was implied in their design - the starting representation was a 
Fourier transform derived short term spectrum of speech with equal frequency resolution 
at all frequencies. The task was to derive optimized projection for classification of real 
speech into phoneme classes. Though not shown here, the principal component analysis 
(that simply projects on directions of maximum variability of the data) yields discrimi- 
nants with equal resolution at all frequencies. Thus, it is the classification task implied 
in LDA that is behind the human-like frequency resolution of the spectral discriminants. 
The result suggests that the signal should be processed with human-like spectral ana- 
lyzer. We find this result remarkable and we believe that it supports optimality of speech 
code with respect to properties of human hearing. 

The conventional ASR typically does not use more than 100 ms of the signal for the 
initial classification. However, it is known and well accepted that because of the inertia 
of speech production organs, the effect of a phoneme spreads over a considerable length 
of time (coarticulation). The result of LDA on temporal vectors suggest the need for 
collecting all the available evidence about the occurrence of any given phoneme in the 
stream of speech data. 

Temporal LDA-derived discriminants span time intervals of the order of several 
hundred milliseconds and emphasize modulation frequencies within the 1-15 Hz range. 
Many properties of human hearing (some of which are reviewed in [8]) imply time 
constants of the order of several hundred milliseconds. Also, sensitivity of human hearing 
to modulations is consistent with frequency responses of the discriminants. Are we again, 
just as in the case of the LDA derived spectral basis functions, looking at some supporting 
evidence for optimality of speech code with respect to human hearing? 
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Fig. 3. FIR RASTA filters derived from LDA analysis of short-term critical-hand spectra of about 
2 hours of phoneme-labeled telephone speech from multiple speakers. 



3 Classification from Spectral Patterns (TRAP) 

3.1 Introduction to the Problem 

Some speech sounds can be emulated by emulating their short-term spectral envelopes. 
Consequently, the decomposition of speech signal into source and spectral envelope 
forms basis of many speech coding techniques. Since ASR evolved from speech coding, 
most of current ASR devices use stochastic pattern matching of features, which are de- 
rived from short-term spectral envelopes of speech sounds. This means that the phonetic 
quality of incoming speech segment is estimated from the shape of the spectral envelope. 

Short-term spectral envelopes of speech are easily corrupted by relatively minor 
factors such as by frequency response of communication equipment or by frequency 
localized noise. In a search for alternative features for ASR we have developed techniques 
for normalization of spectral envelopes based on temporal filtering of spectral envelopes 
[6], which demonstrated that some types of slowly-varying or fast- varying noises can 
be more easily handled in temporal domain. 

Spectral envelopes reflects configurations of acoustic resonators formed by varying 
shape of vocal tract in production of speech and are accepted as primary carriers of 
linguistic information in speech. However, experiments with perception of speech-like 
sounds carried almost half century ago, already indicated some problems with static 
spectral envelope as a carrier of linguistic information in speech. In early fifties Cooper 
et al. [3] studied human recognition of energy bursts followed by stylized vowel-like 
sounds. Such a stimuli are typically associated with stop-vowel CV syllables. The au- 
thors observed a peculiar result: a continuous change in frequency of the burst could 
yield discontinuous classification! When the burst precedes front vowels, (which were 
emulated in Cooper’s experiments by two spectral peaks), and as the frequency of the 
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burst increases, the percept changes from /p/ into fkJ, then back into /p/ and again into 
Dd before it finally is perceived as /t/. On the other hand, when the burst precedes back 
vowels (emulated by a single spectral peak), the percept goes from /p/ into /k/ and back 
into /p/ again and hnally into /t/. Thus, it appears that the main cue for the percept of fkJ 
is the frequency of the stop burst is close to a major concentration of spectral energy in 
the following vowel, independently of the actual frequency at which it occurs. 

While studying nature of critical bands in hearing we have realized one possible 
mechanism that could allow for the results of Coopers et al. experiments. Such a mech- 
anism would be based on Fletcher’s hndings [4] from his masking experiments, which 
demonstrate that uncorrelated noise (signal) outside the critical hand has only a negligi- 
ble effect on detection of the signal (another signal) within the critical band. Thus, the 
human auditory perception may be at least in principle capable of independent process- 
ing within the individual critical bands. Fletcher further proposes that errors in human 
recognition of nonsense syllables within relatively narrow articulatory spectral hands 
(each articulatory band spanning about 2 critical bands) are independent. Therefore, 
under the Fletcher’s paradigm, human cognition could use temporal evidence in either 
upper or lower frequencies for recognizing /p/ from fkJ. 

Fletcher’s results as presented hy Allen [1] led us [7] and others [2] to propose 
a method of addressing frequency-localized noises in the multi-band ASR paradigm. 

Gradually we realized that the concept of spectral envelope as a prime carrier of 
linguistic information could be challenged. Spectral envelopes are relatively fragile in 
presence of noise. Why would forces of nature choose such a fragile carrier of linguistic 
information? Relative success of the multi-hand ASR then led to proposition that deriving 
the spectral shape for the classihcation may not be the prime goal of frequency selective 
hearing but that this selectivity is rather used to choose high signal-to-noise parts of the 
signal spectrum for the subsequent temporal classihcation [8]. 

3.2 ASR Using Temporal Patterns of Spectral Energies 
in Individual Critical Bands 

As reviewed above, critical band masking experiments suggest a possibility of some 
independent cognitive processing of the information within each individual critical band. 
Subsequently, we proposed a specTRAl Pattern (TRAP) based classihcation that uses 
relatively long (about 1 s) patterns of critical band energy in each individual critical hand 
as inputs to individual estimators of likelihoods of phoneme classes. Since the patterns 
are quite long, each pattern spans a number of phonemes (roughly 15 in average) but it is 
used to classify only the phoneme in its center. To obtain hnal estimates of phoneme class 
probabilities, another classiher then combines likelihood estimates from the individual 
critical bands. These hnal estimates are used in a search for the best matching speech 
utterance. Since the individual temporal patterns are relatively long, it is possible to 
remove mean and normalize variance within each pattern. This is done to gain some 
more robustness in presence of noise. 

The TRAP technique is an extreme case of the multi-band ASR [7,2] where the 
sub-bands are critical bands and the temporal context is rather large. 

The TRAP-based recognizer yields results, which are comparable to results from 
conventional recognizers [9,10,13]. Flowever, error patterns from the TRAP-hased rec- 
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Conventional spectral envelope-based ASR 

10 -60 ms 




phoneme 



Temporal pattern (TRAP) based ASR 

■* 1000 ms ► 



[ classifier ] 



phoneme 



1 S(a,.,t) 



Fig. 4. Fundamental differences between conventional spectral envelope based ASR, which uses 
feature vectors derived from short-term spectral envelopes of speech (i.e. from vectors obtained 
by slicing the time-spectral pattern of speech along its frequency axis) and the TRAP based ASR, 
which uses temporal feature vectors obtained by slicing the time-spectral pattern along its time axis. 



ognizer are often different. While the conventional (i.e. spectral envelope-based) recog- 
nizer yields better results for sonorants, the TRAP-based recognizer tends to outperform 
the conventional one in classification of obstruents [9]. This can be used to advantage 
when combining information from these two different techniques. More details can be 
found in [13]. 

It is instructive to see how the individual critical-band classifiers estimate likelihoods 
of classes based purely on temporal information within a critical band. Figure 6 (adopted 
from [13]) shows outputs from all 15 individual critical-band classifiers for fhe inpuf 
signal corresponding to a front vowel /iy/. The 5 Bark critical-band output is enlarged 
in a lower part of the figure. As seen, fhe classifiers which acf on the critical bands with 
high signal-to-noise ratio (SNR) (i.e. in the region of the first formant as well as at high 
frequencies which are dominated by a cluster of higher formants for the front vowel 
/iy/), the classifiers correctly indicate the speech sound as coming from a sonorant (i.e. 
outputs for sonorant classes are high). In the region in between the formants (i.e. the 
7-11 Bark critical bands), the classifiers yield high-entropy indecisive outputs. 

The TRAP technique is principally different from the spectral envelope matching 
technique applied in the conventional ASR. Instead of searching for the best match of the 
envelope, TRAP is searching for the best match of estimates of likelihoods of sub-word 
classes in the individual critical bands. Within each critical band, the match is derived 
based only on the temporal pattern of the spectral energy within this band. Within a 
phoneme class (e.g. within the class of vowels), these patterns appear to be quite similar 
in all critical bands, which contain high SNR (i.e. high quality) signal. In such high 
SNR bands, the estimated likelihoods are high for the correct class phoneme models and 
low for the incorrect ones. In low SNR bands, the patterns are indistinct, yielding low 
likelihoods for all classes. 
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Fig. 5. Configuration of a complete TRAP based classifier. The individual critical band classifiers 
classify temporal feature vectors representing temporal evolution of critical band energies within 
each critical band. Outputs of all critical band classifiers form an input to a final merging classifier, 
which delivers estimates of prohahilities of the phoneme classes. Typically, all classifiers are 
nonlinear (i.e. neural net based) classifiers. 

3.3 Discussion 

We believe that the TRAP technique offers a practical and interesting engineering al- 
ternative to the conventional spectral envelope based ASR. However, its viability in 
practical ASR also opens space for some more general speculations about the nature of 
human speech communication process. 

Let us start with speculations of how to develop means for communication by sound 
in a world full of noises. It would make sense if the communication sounds were broad- 
band so that the energy of the signal would be distributed in frequency and there could be 
a better chance that at least parts of the message would remain uncorrupted. A frequency- 
selective hearing could then separate the incoming sounds into different channels, some 
of which would at times contain relatively clean signal while the others being contami- 
nated by noise. 

Having a frequency-selective hearing covering a broad range of frequencies may 
allow for coding the information in spectral profiles of speech sounds regardless of 
a temporal structure of the spectral profiles within the sound classes (a frequency-based 
code). This view of speech code appears to be implicitly adopted by most ASR systems 
where the spectral envelope shape is the dominant feature to be classified and the tem- 
poral structure of speech within classes is only crudely modeled and is relatively freely 
modified by time- warping techniques. One obvious argument for such frequency-based 
code (besides the existence of the frequency selectivity of hearing) is the fact that human 
speech production system resonates at frequencies that change with changing shape of the 
vocal tract. Subsequently, at any given moment, some frequency components in speech 
are more intense than others. It is well accepted that many speech sounds (sonorants) 
are characterized by frequencies of these dominant spectral components (formants). 
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Fig. 6. The situation in TRAP based classifier for the input signal representing the vowel /yi/. 
The upper part of the figure shows outputs of all critical band classifiers (numbered 1 - 15). The 
5th band classifier output is shown enlargered in the lower part of the figure. The highest valued 
outputs are marked by black dots. 



However, when coding the information by frequency-based code, the information 
could be easily disrupted by a frequency-localized noise. An alternative role of the 
frequency-selective hearing could be to separate components of the acoustic signal in 
order to be able to choose at any moment the ones with reliable (i.e. not noisy) information 
content for the classification. Such a classification would not be based on an instantaneous 
spectrum of the sound but rather would be based on a temporal pattern of the frequency- 
localized energy. 

If the information was coded simultaneously in a number of parallel frequency chan- 
nels in temporal patterns of spectral energy in these individual channels (a temporal 
code), there would be a better chance that the information could survive at least in some 
channels even when the signal would be disrupted by frequency-localized noises. This 
view of speech code is more in line with multi-band ASR schemes, the extreme case 
of which is the TRAP-based ASR. Results of experiments in perception of narrowband 
speech [17, 5, 14] also support this notion. 

From the view of temporal code, sonorant sounds serve as broad-frequency carriers 
of dynamic information. This dynamic information is imposed by movements of vocal 
organs and is evaluated individually in different frequency bands of hearing. The reason 
for changing frequencies of concentration of spectral energy (formants) in time is that 
such a changes provide (together with closures of the vocal tract and changes in the 
vocal tract excitation) means for creating temporal patterns in individual frequency 
bands. Shifting frequencies of concentration of spectral energy in time also gives a better 
chance of preserving at least part of the information in presence of frequency-selective 
noises. 
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This position is extreme. Similar arguments could be made for a frequency-based 
code, which would be robust in presence of time-localized noises. Most likely, speech 
communication code evolved to be robust in presence of both the frequency localized 
and time localized noises. Subsequently, both the frequency domain and the temporal 
domain are likely to be needed to decode the speech code. 

4 Conclusions 

The present work shows that an ASR system optimized on purely engineering grounds 
for the best performance may as a result of the optimization acquire human-like speech 
perception properties. Additionally, the paper presents some arguments against purely 
spectral envelope-based coding of linguistic information in speech. 
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Abstract. Speech technology is one of the very first technologies, which seems 
to offer real user friendly access to human like systems. A wide field of applica- 
tions should become possible. But reality is still quite different. Automatic speech 
recognition is still far from human capabilities and even speech synthesis does 
not yet offer human like quality. So, reality of applications is still in a rather ba- 
sic status. Background noise, room echoes, changing frequency characteristics of 
electric and acoustic transmission, speaker and articulation variations are only few 
aspects of complexity in speech recognition. Only through new and advantageous 
methods for cancellation of background noise and room echoes, equalization of 
transmission channels and adaptation to varying speakers and articulation new 
applications became attractive. Such interesting applications are in the car, in con- 
sumer applications, in advanced telephone voice interaction and in many other 
areas, where speech is an interesting medium for user friendly man-machine in- 
teraction. The dialog aspect will play a growing role and therefore especially in 
noisy environment further enhancements of speech output will play an important 
role too. We have to care for high intelligibility in such environment to make really 
good and efficient applications. 



1 Introduction 

Contrary to many other machine applications, man-machine interaction hy speech has 
to provide a rather human like interaction. This means that most applications are only 
accepted when the dialog impression is similar to our true human interaction. This means 
that recognition and understanding of speech should he nearly perfect. Recognition 
errors or misunderstanding is surely happening during human dialogs hut these are often 
checked and corrected in the natural dialog, which is based on our general knowledge 
and all the situational factors. 

1.1 Relevant Application Factors 

The most important factors responsible for a successful man-machine interaction are 
factors which are finally responsible for a human-like interaction; 

• Recognition and understanding should have a low error rates and in all cases where 
errors are happening these should be intuitively understood and easily be corrected. 
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• Speech output should have an extremely high quality and it should be as human-like 
as possible. 

• Dialogs should be as intuitive as possible. This means usually that people should be 
able to speak as they are used to speak and work within a certain task. This normally 
means that it should be possible to speak in a more or less spontaneous way and 
with no specific restrictions concerning syntax and vocabulary. 

• One should not need any specific microphone, especially no head-set microphone. 
The microphone should be anywhere in the environment where speech input is used. 

We all know, that most of these factors and requirements are only fulfilled partially and 
that reality is still far away from having a perfect speech input system. But on the other 
side we know that speech technology has made substantial progress in the last years 
and that for some important applications the technology breakthrough has already been 
realized. 

1.2 Application Factors for Telephone Speech Input 

An example where speech input is already rather successful are telephone based speech 
input systems, where some decisive technology problems could have been solved in 
a good manner and where simultaneously systems have been designed which are a good 
compromise between user requirements and systems capabilities. 




Fig. 1. Relevant aspects for applicability of speech recognition in telephone speech systems. 



All the aspects shown in Figure 1 are more or less important for applications. The 
main aspect is always robustness against many varying parameters which are partially 
described in the figure too, like resistance against different sorts of noise distortions or 
speaker variations. Especially speaker independence of recognition has made excellent 
progress in the last few years, mainly through the availability of large data bases with a 
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huge variety of different voices and dialectal factors. Of course the whole area of dialectal 
variations produced through foreign languages is still an important field in a globalizing 
society. 

This example may give a rather small impression about the problems still to be solved 
for a realistic solution in an area where highly accepted applications still are realized and 
which shows, that even with restricted technology limited and well accepted applications 
are possible. 



2 Some Selected Applications 

Some few examples of applications may show, where essential problems have been 
solved and where still major problems should be solved in the future. 



2.1 Systems for Application in Cars 

In the last few years applications in cars have become some of the dominating applications 
of speech systems. This has mainly two reasons. The multitude of upcoming electronic 
information services requires specific safe operation technology and the technologi- 
cal enhancements of speech technology, including e.g. speaker independence too, have 
opened attractive solutions. Most information in a car have only a short lifetime. Many 
applications are concerned with the control of systems, like entertainment, telephone and 
navigation systems. In the future additional applications will become relevant, which 
address much more information exchange, like automatic understanding of e-mails and 
possibly formulating some answer. 

The distraction problem will there become a dominating challenge. Of course all 
the interactions we do in the car have some distraction factor, even some discussions 
with passengers or active children in the background. But the intelligent car may further 
raise this to a level which may no more be acceptable. The only solution is an excellent 
design of interaction methods like good dialogs and adapted information presentation. 
The driver should not get distracting information when he is in a critical situation. This 
means that all the telematic systems operated by speech and other means should get 
information about the actual traffic situation and use this information in an adequate 
manner. 

2.2 Public Information Systems 

The telephone is the most widely used communication basis for human dialogs and it 
is therefore quite clear that for man-machine dialogs this is an even important medium. 
This interaction channel has many advantages compared with the hands-free aspects 
which are relevant in car applications. As it had been already shown in Figure 1 the 
telephone has other important technology aspects to be solved. Of course dialog aspects 
play a major role and especially all the problems related to interaction with occasional 
and untrained users become rather dominant. There is usually no chance to train the 
users. Therefore the dialog has to be self explaining or it will not be usable successfully. 
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2.3 Consumer Appliances 

To realize voice control for consumer appliances including voice activated dictation is 
a very old dream of technicians. Especially the latter was already proposed in the early 
years of the 20th century. But it has been seen that especially consumer environments like 
households and offices often are not well suited for applications and especially the fact 
that such applications often require hands-free voice control can make major problems. 
In spite of these restrictions there are in the meantime some convincing applications 
where even the hands-free problem has been solved in a nearly satisfying manner. 

3 Dialog Aspects 

For the human user, his present knowledge and pragmatics are the relevant aspects in the 
specific dialog situation. In many applications the actual situation plays an additionally 
important role, e.g. in a car the actual traffic situation may have an important influence on 
the voice behaviour of the speaker. A system must therefore be able to adapt to different 
situations and it must react to changing wishes and it should simultaneously be prepared 
to care for a rather simple and easily understandable interaction. All these aspects are 
still not yet solved. Machine dialogs often are too much fixed to a certain situation. This 
may be sufficient in a case where the system is understanding perfectly, but it makes 
problems when there are recognition errors. 

4 Speech Output 

The relevance of speech output in the voice control dialog is often underestimated. But it 
is surely as important as the speech input. Well done design of speech output in dialogs 
and high quality speech synthesis will be a necessary requirement for most applications. 
In a car, the output should e.g. be rather short and precise, while in an aid for elderly 
people or in certain telephone dialogs it may be acceptable to have rather detailed output 
which can surely repeat some relevant aspects, to make sure that people really understand 
the content without major problems. 
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Abstract. A so-called generalised phoneme recognition problem for the two-level 
speech understanding system is being solved. It means that under free pho-neme 
order it is being found the N ^ 1 best phoneme sequence recognition responses. 
The method is based on constructive description of diverse realisations of a speech 
signal. A stochastic generative automata grammar, which is assigned to synthesise 
the speech signal prototypes, serves for it. This grammar composes all possible 
speech signal prototypes with allowance for non-linear rate of pronouncing in 
general, and of the pronouncing of individual phonemes in particular, as well as 
co-articulation and reduction of sounds and non-linear variation of the speech 
signal intensity along the time axis. To make deeper the earlier fulfilled research, 
phoneme-threephones (PT) signal prototypes are introduced. Rules for joining of 
PT signal prototypes into sequences are evident: the output and input phonemes of 
joining PT have to coincide. The problem is being solved using new computational 
scheme of dynamic programming, based on (for substantial reduction in both 
memory and calculation requirements) concepts of potentially optimal index and 
phoneme response. 



1 Introduction 

Still it is retained popular such approach in automatic speech recognition and understand- 
ing. It assumes that firstly continuous speech must be recognised as phoneme sequence, 
and then this phoneme sequence must be recognised and understood as word sequence 
and meaning to be transmitted by a speech signal. 

Although this approach seems to be erroneous, since the best method of finding 
of phonemes to be transmitted is both to recognise and to understand a speech signal, 
however it shows a preference for simplifying the research job distribution between 
specialists in acoustics, phonetics, linguistics, informatics. 

To get better this approach it was proposed to introduce significant decisions in 
phoneme recognition procedures [1]. The next step consists in making improvements to 
used generative automata grammars, for example instead of phoneme-diphones speech 
model to put into operation a phoneme-threephones one. 

In this paper it is proposed a so-called generalised phoneme-threephone recognition 
problem for the two-level speech understanding system. The structure of this system 
is shown in Figure 1. A generalised phoneme recognition problem means that under 
free phoneme order it is being found the ^ 1 best phoneme sequence recognition 
responses. Then a Speech Interpreter analyses these phoneme sequences through Natural 
Language Knowledge filter. 
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2 Phoneme-by-Phoneme Recognition in Continuous Speech. 
General Idea 

The general idea is, taking into account inertial properties of articulation apparatus and 
language phonetics only, to construct some PT generative automata grammar which can 
synthesise all possible continuous speech model signals (prototypes) for any phoneme 
sequence. This grammar has to reflect such phenomena of speech signal variety as non- 
linear change of pronouncing both rate and intensity, sound co-articulation and reduc- 
tion, sound duration statistics, phonemeness, and so on. Then the phoneme-by-phoneme 
recognition of unknown continuous speech signal will be involved in a synthesis of the 
most likely speech model signal and a determination of the phoneme structure of the 
latter. 
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Fig. 1. Two-Level Speech Understanding System Structure 



The problem of directed synthesis, sorting out and formation of a phoneme sequence 
recognition response is solved by using a new computational scheme of dynamic pro- 
gramming, in which (for a substantional reduction in memory and calculation require- 
ments) the concepts of potentially optimal both index and phoneme are used [1]. 

At first, the phoneme-by-phoneme continuous speech recognition problem will be 
considered. Then this statement will be generalised for ^ 1 best phoneme sequences. 

3 General Allowable Pboneme-Tbreepbone Sequences 
Generative Grammar 

This mentioned generative grammar for free phoneme sequences will be given under 
phoneme-threephones (PT) interpretation. 

Let be given the finite set K of the phonemes k G K. The phoneme alphabet includes 
the phoneme-pause #. For example, in the Ukrainian phoneme alphabet K there will 
be distinguished stressed and non-stressed vowels, hard and soft consonants, stationary 
phonemes like k G{A, O, U, E, I, Y [all stressed and non-stressed], V, V’ [the symbol 
’ denotes softness], H, H’, ZH, ZH’, Z, Z’, J, L, U, M, M’, N, N’, R, R’, F, F’, KH, 
KH’, SH, SH’, #} = AT®* C AT, which change their duration, and transitive phonemes 
k g{B, B’, G, G’, D, D’, K, K’, P,P’,T,T’} = C K. 
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Generally speaking, natural language allows totally PT but hereafter there are 
considered only about 2,000 — 3,000 basic PTs t G T, which approximate all possible 
PTs. We remind that each PT t from the PT alphabet T besides the name t has also the 
triple name t=uWv where u^W,v G K and u, ware input and output phoneme names or 
non-terminal symbols for PT t, respectively. So, the PT t=uWv is the phoneme W that is 
considered under influence of neighbouring phonemes u and v. They are the first u that 
precedes Wand the second v that follows W. 

Obviously, only PTs t\=uWv and t 2 =wVz are alowable for connection via Wv and 

wV. 

From now on we will assume that besides phoneme and PT alphabets there are 
given such knowledge: 

A. A finite set E of elementary speech signal prototypes or typical one-quasiperiodical 
segments e(j) G E where j G J is a e(j) name in the name alphabet J. E.g. there 
are | J| = \E\ = 2^®elements in E and J. So the set J makes the microphoneme 
level of speech patterns and the pair ( J, E) is the code book for one-quasiperiods. 

B. A finite set T of basic PT t G T. The PT t is specified by its acoustical transcription 

in the alphabet J:t = (jti, jts,---, jtq(t))^ where s indicates the ordinal place 

in the transcription and q{t) is the transcription duration for t. 

C. Distributions P(a:/j) of observedelements(quasiperiods)xforallj G J, particularly 
P{x/j) = P{x/e{j)). 

The knowledge mentioned in A, B and C are found at training mode [ 1 ] . For each speaker 
they form a so-called Speaker Voice File (Passport). 

After the preprocessing a speech signal to be recognised is presented by the sequence 
Xoi of observed one-quasiperiodical segments or elements xf. 

Xqi (xi , X2 , . . . , , . . . , X/ ) , 

where I is the quantity of observed quasiperiods. The segment 

Xmn — {Xm+1 5 ^m-t-2 5 • 5 Xn') , 0 ^ TTl ^ Tl "G I 

is considered as a signal realisation of the PT t with the probability which is calculated 
as the convolution on microphonemes bounds {xs}: 

9(t) rg 

P{Xmn/t) =TaaxY\ n P(^i/jts), (1) 

Vs I T • I 1 

S — 1 l — Ts-l + l 

where xq = m, Vs-i < Xg ,Xq(j) = n. The respective stochastic generative automata 
grammar (graph) for both PT model signals generating and comparison of the signal 
segment A^nWith all generated ones accordingly to (1) is shown in Figure 2a. That 
graph has q{t) states. To each state s it is ascribed the microphoneme j(s) = jts 
with the distribution P{xljts)- The transitions between states are doing in accordance to 
arrows and during 0 or 1 discrete time steps. It is forbidden to remove microphonemes 
here. The grammar shown in Figure 2b forbids to remove more than two microphonemes 
running. Schematic notes for PT graph t=uWv, u,W,v G K are given in Figure 2c, where 
only the input s = u and the output s = v states are distinguished. 
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Fig. 2. Generative grammars (graphs) for the phoneme-threephone: a) no microelement omission; 
h) no two microelements running omission; 3) schematic notes of the PT graph t = uWv. 



Let us unite all PT graphs into common one. It is allowable to connect PTs t\=uWv 
and t 2 =wVz into phoneme sequence so that the output pair name Wv of preceding PT 
coincides with the input phoneme pair name wV of the following t 2 - 

Going such a way it will be received a common phoneme graph (CPG) for continuous 
speech signal generation. The full CPG for three phoneme alphabet K ={A, B, C} is 
shown in Figure 3. 

For each phoneme W G Ff it is corresponded a block of all |FFp PTs uWv, where 
u,v G K. In Figure 3 the blocks are denoted by dotted line. Each block of phoneme W 
has \K\ input buses uW, u G Ffand \K\ output buses wZ, Z G K. Output buses wZ are 
connected with input buses under the same name. 

Additionally, it is distinguished the input state s = u, the output states = v and 
internal states s for each PT t=uWv, u,W,v G K. One of phonemes is associated with 
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the phoneme-pause #. It means that the block of phoneme # has an input hus ## being 
started at. Let us also introduce the overall enumeration of states within each PT on the 
CPG accordingly with a permissible movement along the arrows. 

Looking into CPG the best phoneme sequence recognition response or, that is the 
same, the best permissible PT sequence recognition response is dehned by maximisation 
of the expression (2): 

Q 

P(Xoi/(fi,...,f^,...,fQ)) =ma.xY[P (2) 

where {ts} are the bounds between phonemes-threephones in Xqi. 



4 Phoneme Sequence Recognition Algorithm 



Let be designated by l7i(s) a set of countinuous speech prototypes of duration i which 
are generated by the CPG as a result of movement from input bus s=## to any state 
s within i time steps. Let be denoted by Fi{s) the best probability (2) which is reached 
on the set l7i(s) but for the initial speech segment Xoi = {x\, Xi), and by rii{s) 
the potentially optimal beginning of the last PT L(s) in the best PT sequence for 

Let Fr{s),rir{s), tr{s) have been calculated for all states s and for all time steps r < i 
which precede i. Then after the next observed element Xi appearance simultaneously (in 
parallel) for all states s new values Pi(s), Tii(s), ti(s) are calculated In order a), b) and 
c) : 

a) for all internal PT states s € t=uWv, besides PT hrst states, and for all t (see Figure 2a 
and Figure 3): 



Fi{s) = max{Fi_i{s - l),Fi_i(s)} • P{xi/j{s)), 

„ . /„N / (s- 1) , if Pi -1 (s- 1) > P*-i (s); 

* 1 n*_i(s), if Fi_i (s-l) < F,_i (s); 

b) for all first internal states s = si{t) Gt=uWv and for all f G T (see Figure 2a): 



Fi{si{t)) = max{Fi_i{u{t), Fi_i{si{t))} ■ P{xi/ j{si{t))), 
i-l, if Pi-1 (m (f)) > Pi-1 (si (f)) ; 

n*-i (si (f)) , if Pi-1 (m (f)) < Pi-1 (si (f)) ; 



rii (si (f)) = 



c) for all output buses a=wY of all PT t=uWv, t G T(a=n’V)={t=uWv: W {f)v{f)=Wv} 
with the same W{t)v{t)=Wv : 



P*(a) 

ti (a) 



max 

t—uWv.wV—O' 

Ui (a) Wv = 



Fi (sf (t)) , 

- arg max 



Fi (S2 (t)) , 



t—uWv.wV—a 

Hi (a) = rii{s 2 {ti{a))) = rii{sf (ui{a)Wv)) , 

where s / (t) is the final state of PT t and (3i (a) = Ui (a) W is potentially optimal 
input bus respectively previous potentially optimal output bus for the potentially 
optimal PT L(a), which begins at the time rii{a) and ends at the time i. 
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For the phoneme sequence recognition response forming it is sufficient to remain in 
the memory the 3-le array Fi{a), Tii(a), ti(a), a=wV, w,V G K, i=l:l. 

Since a continuous speech signal Xqi begins and ends by PT t =### with output bus 
## then the phoneme-threephone sequence recognition response is formed by the next 
extracting algorithm. Let be n* = I, ct* = ## and for r=l,2,3,--- it will be extracted 

a*_|_i = = Pn* until = 0 is reached. Then 

the phoneme sequence Vp, r=l,2,3,... will be the phoneme recognition response in the 
opposite direction and n*, r=l,2,3,... will be the respective phoneme bounds in the signal 
Xqi . 

To begin the recognition process it is assigned Fo(a =##)=! and Fo(cr)=0 for all 
other input and output buses. 

5 The Generalised Algorithm 

To find N ^ 1 best phoneme or PT sequences in the signal Xqi let us modify the basic 
algorithm. 

Now for all output buses a=wV, w,V G K in the CPG and for any time step i it will 
be calculated 7V-le of not 3-le but 41e {F[ (a) , n[ (a) , PI (a) , (a)) , r=l:N which 

is composed of N best probabilities F[ (a) that correspond to N best but different 
phoneme sequence recognition responses for Xoi . 

Accordingly with basic formulae for the PT internal states, there is a possibil- 
ity to choose the N best different responses by considering the 2N probabilities F. 
While there are more than 2N, exactly \K\N, similar probabilities for output or in- 
put buses. To ensure the linkage for the extracting algorithm we need to introduce the 
value (a). It indicates (a)-th place in the previous N-le which the considered 4-le 
{FI {a) , vp (a) , PI (a) , < (a)) refers to. 

Now for the generalised PT or phoneme sequence recognition response forming it 
is necessary to locate in memory the A-le of 4-le {F[ {a) , < (a) , /3[ (a) , z[ (a)), 
i = 1 : Z for all a=wV, w,V G K, and then to use a little complicated extracting 
algorithm. 

6 Conclusion 

There exists such an opinion that it is possible to design a machine for automatic phoneme 
recognition in continuous speech like phonetic stenograph without any appealing to 
speech understanding. Here it is proposed one effective robust algorithm for this problem 
solving which guarantees TV ^ 1 best phoneme sequence responses hnding. 

It is expected a similar effect in the automatic speech understanding under taking 
into account a prosodic information. 
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Abstract. The paper deals with the problem of choosing and optimizing the in- 
ventories of speech segments (especially with respect to the concatenative speech 
synthesis). We offer taxonomy of the segment databases based on elementary 
properties of the segments in relation to a basic speech corpus. Next, we deal with 
a general abstract formulation of the problem and briefly discuss its algorithmic 
solution and applications. 



1 Introduction 

Present concatenative speech synthesis systems use a variety of segments: allophones, 
diphones, triphones, half-syllahles, demi-syllables, syllabic segments and some other 
types. Thanks to increasing technical parameters of computers, the number of segments is 
nowadays not so much critical and we can see attempts to employ this fact to increase the 
quality of synthesized speech by using larger segments that involve more coarticulation 
(see [1,7,2,3,10,17]). 

A natural question arises if we look for an appropriate set of segments for a given 
model: is the chosen set of segments (in some sense) optimal? This question is especially 
very natural when using heterogeneous segments. In what follows, we will offer a general 
abstract formulation of this problem, we will discuss possible methods that would allow 
us to obtain the optimal set of segments and we will mention a concrete application as 
well. 

In the text we use standard terms and notation of the theory of formal languages and 
automata. If M is an alphabet (i.e. a finite nonempty set), then M* will denote the free 
monoid over the set M, i.e. the set of all strings consisting of the elements of the set M 
(including the empty string). card{M) denotes the cardinality of M, i.e. (for finite sets) 
the number of elements belonging to the set M. 

2 Taxonomy Of Segment Databases 

First, we would like to systemize basic possible sets of segments. Our approach is based 
on algebraic abstraction that enables us to see the problem generally. 

Let A be an alphabet. In our interpretation, the alphabet is a basic set of the speech 
segments. For example, it can be the set of allophones. Another possibility is to choose 
diphones, demisyllables, etc. The basic set of segments in our model is used to describe 
derived, more complex segment databases. 



V. Matousek et al. (Eds.): TSD 2001, LNAI 2166, pp. 208-213, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 




Algebraic Models of Speech Segment Databases 209 



Now, let S' be a finite nonempty subset of A*. The set S has the meaning of segment 
database. In our abstraction we assume that all elements of S consists of the basic 
elements, i.e. that all elements of S are strings over the alphabet A. An important case 
that is of our interest is especially databases of “large” segments, e.g. syllabic segments 
combined with morphemes (see Section 4). Formally, there is no difference between 
the abstract “segment database” and abstract “corpus”. Theoretically, all elements of the 
speech corpus can be taken as segments. Practically it is usually impossible due to great 
number of the corpus elements. 

Finally, let C be a finite nonempty subset of A* . C will be interpreted as a speech 
corpus. 

As an illustration, we can consider A to be the set of phonemes, S to be the set of 
syllable segments and C to be a speech corpus. 

Now, we will introduce the following classification of the basic segment sets: 

2.1 Compatibility 

We shall say that a set of segments S = {si, S 2 , ..., s„} is compatible with the corpus 
C, if for any u G C there are Si, Sj , ..., Sk G S such that u = SiSj...Sk- 

Let us mention that in this model the alphabet need not be identical with phonemes 
(we can use allophones as well or we can define an elementary alphabet so that it suits to 
our approach). Clearly, the alphabet and the corpus C can be viewed as sets of segments 
compatible with the corpus C (when we identify the elements of the alphabet with the 
corresponding strings). Let us denote S{C) the set of all sets of segments compatible 
with C. 

2.2 Consistency 

We shall say that a compatible set S is consistent, if each element of S' is a substring of 
a string belonging to C. Consistency is a very natural assumption - in most of the appli- 
cations the segments that violate this condition are superfluous and can be eliminated. 

2.3 Segment Bases 

We shall say that a compatible set S is a base (of C), if removing any element of S implies 
that that the resulting set is not compatible. Thus, having in mind just compatibility, the 
bases are in this sense optimal. 

2.4 Minimality 

We shall say that a compatible set S is minimal, if for any compatible set S' 

card{S) < card(S'). 

The minimality condition is a natural demand in many applications, because it can 
reduce the work spent on building it up. Provided C is sufficiently large and representa- 
tive, the alphabet A is expected to be the only minimal set. It is easy to see that this is not 
true generally. There exists a polynomial algorithm for determining minimal sets [13]. 
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2.5 Strong Minimality 

We shall say that a compatible set S is strongly minimal, if it is minimal and for any 
minimal set S' 

’^^lgth{s) < ^ lgth{s) 
aes seS' 

(Igth stands for the length of the corresponding string). 

Strong minimality does not follow from minimality, as it can be seen e.g. from the 
following example: 

C = {xy,y},S = {x,y}. 

Both C and S are minimal sets for C, but only S is strongly minimal. 

2.6 Homogenity 

S is homogeneous if each element of C can be obtained uniquely as concatenation of the 
segments belonging to the set S. This condition is fulfilled for many practical instances 
of segments databases, like allophones, diphones, syllable segments, etc. 

2.7 Strong Homogenity 

A compatible set S' is strongly homogeneous if no element of S is a substring of a different 
element of S. This conditions is stronger than homogenity (see Proposition 2). 

2.8 Strict Homogenity 

A compatible set S is strictly homogeneous if for any u € S do not exist v G S and 
w G S such that v ^ u, w ^ u and u is a substring of vw. This conditions is stronger 
than strong homogenity (see Proposition 3). 

2.9 Heterogenity 

A compatible set S is heterogeneous if it is not homogeneous. Typically, heterogeneous 
segments can be used in larger databases that are developed in order to achieve natural 
sounding speech by means of involving more coarticulation effect inside the segments. 

2.10 Elementary Properties of the Basic Types of Segment Sets 

Directly from the definition we get the following properties of the basic classes of 
segments: 

Proposition 1 . Any base is a consistent set. 

Clearly, the elements (of a compatible set) that violate the consistency condition can 
be removed leaving the set compatible. 

Proposition 2 . If S' is strongly homogeneous, then it is homogeneous. 

If u = U\U 2 = V 1 V 2 and ui ^ V\, then either ui is a substring of vi or vi is 
a substring of ui, i.e. if the set is not homogenous then it is not strongly homogenous as 
well, which is equivalent to the assertion. 
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Proposition 5 . If S' is strictly homogeneous, then it is strongly homogeneous. 
Proposition 4 . Any consistent strictly homogeneous set is a base. 

(However, a consistent strongly homogeneous set need not be base.) 

Proposition 5 . Any minimal set is a base. 

(However, a base need not be minimal set.) 

Besides this “set theoretical” taxonomy we can, of course, distinguish the segments 
also from other points of view. For example, based on the related linguistic disciplines we 
can distinguish phonetic segments (allophones, diphones, etc), phonological segments 
(syllables) [5,8,9,10,11,12] and morphological segments (segments derived within mor- 
phology) [14,15,16]. 

3 Abstract Formulation of the Optimization Problem 

Let S G S(C) and u G C. We denote \u/C] the minimal number of the segment 
boundaries in u (in all possible concatenations of u by means of the segments belonging 
to S). Further, we denote [C / S] = Now, we formulate the objective 

function g{S) as follows: 



g{S) = acard{S) + (5[C/S\. 

This definition has the following motivation. When defining the set of segments, we try 
to choose them in such a way that their number (card{S)) is as small as possible and 
simultaneously we try to involve maximum coarticulation in the segments, which means 
that we would like also minimize the number of the boundaries in the concatenated 
speech ([C/S']). (The coarticulation compatibility on the boundaries of the segments 
has to be ensured by choosing the alphabet A). The (non-negative) coefficients a and 
(3 are weights that we assign to the specific demands on the optimality criteria. Now, our 
optimization problem sounds as follows: 

Problem 1 : Find a set S of segments that minimizes the value g{S). 

Of course, this problem can be solved by evaluating g{S) for all possible set of 
segments (compatible with C), but this algorithms is exponential and practically not 
usable. Hence, we have 

Problem 2 : Is there a polynomial algorithm solving Problem 1 ? 

If we choose 6 = 0, we obtain the problem of finding a minimal set, which is polyno- 
mially solvable (see Section 2.4). Unfortunately, the author does not know a polynomial 
algorithm for the case b ^ O.ln this situation, an approximative algorithm would be also 
useful. Hence, we can formulate 

Problem 3 : Find an approximative polynomial algorithm solving Problem 1 . 

Until Problem 2 and Problem 3 remain open, we cannot solve effectively Problem 
1 . Instead of it, we may substitute Problem 1 by the following 

Problem 4 : Find a set S of segments that locally minimizes the value g{S). 

By “locally minimizing” we mean finding a set of segments S such that by both 
adding or removing an arbitrary segment the value g{S) increases. Clearly, if S minimizes 
g{S), it locally minimizes g{S) as well. This problem is algorithmically solvable in 
polynomial time (provided that the length of the elements from C is limited by a number 
that is not dependent on the instance C). A possible algorithm is as follows: 
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1 . Take a compatible set of segments S. 

2. Try to add a segment that decreases g{S). 

3. In positive case, repeat 2. 

4. If such a segment does not exist, try to remove a segment such that g{S) will decrease. 

5. In positive case, repeat 4. 

6. If such a segment does not exist, S locally minimizes g{S) and the computation 
stops, else go to 2. 

Because the value of the function g{S) is an integer, the computation stops in smaller 
number of steps then the initial value of g{S). The algorithm can be modified by speci- 
fying the order of adding and taking off segments in order to reach as small as possible 
value g{s) or in order to speed up the process. Also this issue should be more deeply 
investigated and some estimation of the effectiveness of the algorithm should be found. 

4 Applications 

The above mentioned problems may arise is relation to creating and optimizing het- 
erogeneous segment database. We are applying it in building Czech speech synthesizer 
Demosthenes [10,12]. The synthesizer was originally based on the syllable segments 
[5,8,1 1,12], but in order to enhance the quality of the synthesized speech we have de- 
cided to use also frequent bi-syllables (chosen by statistics) and morphological segments. 

The experimental use of morphological segments is motivated by the fact, that in 
a highly inflected language like Czech a suitable combination of morphemes and syl- 
lables creates an effective database of speech segments. We must however use some 
optimization criteria that will eliminate the segments whose use would be non-effective. 
Presently, we are preparing a corpus suitable for the application of the proposed approach. 

5 Conclusions 

In the paper we have addressed the problem of optimizing the set of segments for 
concatenative speech synthesis and formulated some open problems. Besides solving 
open problems and testing the method for concrete applications, we would like in the 
future work also compare this approach with approaches that may come from statistics 
and information theory, especially to study the relations to conditional entropy and cross 
entropy of the system. 
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Abstract. Our work presents a novel data driven compensation technique that 
modifies on-line the incoming spectral representation of degraded speech to ap- 
proximate the features of high quality speech used to train a classifier. We apply 
the Bayesian inference framework to the degraded spectral coefficients based on 
modeling clean speech linear-spectrum with appropriate non-Gaussian distribu- 
tions that allow maximum a-posteriori (MAP) closed form solution to be set. MAP 
solution leads to a soft threshold function applied and adapted to the spectral char- 
acteristics and noise variance of each spectral band. We perform extensive evalu- 
ation of our algorithm against white and coloured Gaussian noise in the context of 
Automatic Speech Recognition (ASR), and demonstrate its robustness in adverse 
conditions. The enhancement process comes at little to no extra computational 
overhead, thus achieving real time, on line performance. 



1 Introduction 

Although, ASR has come to a point that it enables the launch of commercial products, 
the operational function of ASR machines is restricted by the influence of the acous- 
tical environment. High quality speech used to train recognizers is deprived of various 
sources of variability that are common in operational use. This inconsistency is reflected 
on the degrading recognition score as we move from laboratory conditions towards real- 
world applications. ASR methods make implicit assumptions as to variability induced 
due to noise sources, by building speech sound models based on large speech corpora 
gathered In real conditions thus including in their construction common sources of vari- 
ability. It is practically Impossible to incorporate sufficient speech training data in all 
different contexts, and to cope with unseen or poorly represented sources of degrada- 
tion. One possible solution is to modify the training data by artificially adding noise 
segments from the operational environment and re-train the classifier. However, the phi- 
losophy that currently predominates is to compensate the effect of noise to approximate 
the matched case of training and operational conditions. Comprehensive assessment of 
noise compensation methods that belong to three broad processing strategies can be 
found in [1]. They involve a transformation of the noisy waveform or feature vectors to 
speech/features in which the model of the recognizer has been trained, or corrections to 
the mean vectors and covariance matrices of the distributions of the clean HMM models 
to match the distribution of incoming degraded speech. In this work we try to reduce the 



V. Matousek et al. (Eds.): TSD 2001, LNAI 2166, pp. 214-221, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 




Bayesian Noise Compensation of Time Trajectories 215 



mismatch between training and testing conditions by applying a nonlinear transforma- 
tion to the feature-space of time trajectories of spectral coefficients. In what follows we 
foreground the main presuppositions of successful techniques that apply transformations 
to the feature-space or to HMM models and from which our approach departs. 

The distribution of log-spectral and cepstral coefficients is highly non-Gaussian and 
often multi-modal and asymmetric both for clean and degraded speech. HMM model 
based compensation of additive noise implicitly assumes that the corrupted speech dis- 
tributions in log-spectral or cepstral domain are still Gaussian (see [2] and references 
therein). When a mixture of Gaussians is adopted to improve the approximation of the 
degraded distributions, time and memory cost inflicts serious constraints [2][3][4]. In 
[3] a Minimum Mean Square Error (MMSE) is used to infer the correction factors (mean 
vectors and covariance matrices) applied to mixture of Gaussians of the degraded speech 
in order to match the mixture pdf of clean speech. Unfortunately, MMSE does not per- 
mit closed form manipulation in this case unless a series of simplifying assumptions are 
adopted. In [4] the corrupted cepstrum is assumed to follow the Gaussian mixture as in 
the clean spectrum case where the non-linearity imposed on the clean spectrum pdf. is 
approximated by a Taylor series. Extension of this work to model space is presented in 
[5]. Last but not least, the efficient MAP [6] and MLLR [7] require rather long adap- 
tation data from the noisy environment in order to obtain good estimates to modify the 
clean speech model set. We attempt to address the problem of noise compensation by 
casting it as a problem in Bayesian inference in the linear-spectral domain. We lay out 
its derivation in two steps: 

a) We employ some well-known as well as some new probability distributions to ac- 
count for the non-Gaussian representation of each band of a large ensemble of high 
quality clean speech. The pdf that best represents each band is inferred in the process 
of fitting an approximating distribution to the non-parametrically derived pdf (his- 
togram method). The accuracy of the fit is assessed by the Kullback-Leibler (KL) 
divergence measure between each candidate distribution and the non-parametric 
estimation. 

b) Based on the assumption that background noise is additive and Gaussian (although 
it can be coloured), we use the same unified mathematical framework to derive 
the closed-form MAP estimation of the underlying spectral coefficient. Each band 
possesses its unique shrinkage function depending on the pdf that was selected in 
the previous step and on the noise variance of each band. 

We support theoretical derivations by extensive experimentation using recorded 
speech signals and real noise sources from the NOISEX-92 database. Assessment criteria 
are based on word recognition accuracy of a speech recognition system and subjective 
measurements. Subjective tests include visual comparison of speech spectrograms and 
informal listening tests. 

2 Problem Formulation 

Based on the assumption of a linear, time invariant channel independent of the signal 
level and uncorrelated noise, we can derive a linear-spectral representation using a 2N 
point LET. 

X{u^) = HS{uj^)+N{u^) K = 0,...,W ( 1 ) 
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where X{uJk) denotes the amplitude of the spectrum of the degraded of sub-band 
H the constant over time channel effect and N{uJk.) the noise amplitude. 

The posterior p.d.f. of HS{uJk,) can be expressed according to the Bayes rule: 



X(CV^)) — — — rr 



( 2 ) 



The pdf of /(X(wk) | HS{uJk)) cx N{X{uJk.) — HS{uJk,) — m„, S^) when substi- 
tuted to Eq. 2 formulates the general expression for the a-posteriori pdf of the underlying 
clean spectral magnitude of each individual spectral band: 



HS(o;K)Tnap=arg max 
H|S(^k)| 



1 

\/‘2Tv<j'n 



exp 






* f(HS(u;«)) 



( 3 ) 



We proceed in the estimation of the appropriate spectral p.d.f. of clean speech, that is, 
identification of the marginal distribution and its descriptive parameters for each spectral 
band. 



2.1 Density Estimation of Clean Speech Spectral Bands 

We examine the statistical behaviour of each spectral band over a large ensemble of 
clean recordings to derive the underlying marginal pdf, which is needed in the Bayesian 
formulation of the denoising process. 

The appropriate density to parameterize the distribution of the amplitude of each 
spectral band i = 0, . . . , in Eq. 1, is selected according to the smallest Kullback- 
Leibler divergence between the non-parametric density estimate of each band (histogram 
method) and a suitable, fitted parametric density (Eq. 4). The KL divergence measure 
between the non-parametric density {/o} and the selected representative pdf {/«}, is 
always non-negative and zero in complete match. 

KL{fo,fs) = J fologj^ds (4) 

In order to account for the large variety of probability densities that characterize dif- 
ferent bands of the ensemble, a family of flexible distributions is selected that allows for 
a closed-form MAP estimation to be derived under the assumption of additive Gaussian 
noise n ~ N{ps, ^s) spectral band. The free parameters of each candidate den- 

sity are assessed according to the Maximum Likelihood Estimation criterion that gives 
the highest likelihood given the set of spectral observations of each band, therefore, the 
family of curves that are tested actually comprise a cluster of variations. After finding 
the parameters that describe the best the observational data for each pdf, all densities 
are tested against the non-parametric pdf {/o}. The one that returns the lowest error 
according to KL divergence is selected to represent the pdf of the spectral band. 

Gamma pdf (Eq. 5) is a family of curves determined by two parameters that can take 
a versatile shape, effectively representing highly kurtotic non-symmetric distributions. 
It is essential to note that chi-square and exponential distributions can be derived form 
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Gamma distribution by fixing one of the two free parameters, therefore MLE accounts 
for these cases too. 

f{HS{oj,) \a,b) = ^^^(iT5K))“-'exp(-HSK)/b) (5) 

Gaussian- Lap lac ian pdf (Eq. 6) is a combination of the Gaussian and Laplacian 
density, suitable for representing moderately sparse distributions (sparser than Gaussians 
and less sparse than Laplace for the same variance). 

/(iT5K) \a,b) = Cexp{-^H^S{co,f - bHS{uj^)) (6) 

The symmetric form of Hyvdrinen distribution was originally presented in [8]. 
Hyvarinen’s pdf is very sparse (sparser than any other pdf) and proves very effective in 
capturing the probability specifications of the spectral distributions of most bands. 

f{HS{uj,) \a,b) = C{s/a{a +l)/2 + HS{io^)/dT+^ (7) 

Gaussian pdf (Eq. 8) is suitable for modelling spectral bands that are slightly sub- 
Gaussian. Spectral bands with super-Gaussian pdf’s are captured from the pre-mentioned 
distributions. 



HHSM I „,) = ^ (8) 

In our work we adapted this distribution to enforce positivity in order to be consistent 
with the non-negative characteristic of the spectrogram. Spectral band probability model 
fitting which is the most time-consuming task of the algorithm is computed off-line and 
only once for all subsequent restorations. 



2.2 Noise Variance Estimation 

Though background noise can have spectral density that is case-specific to the operational 
environmental conditions, in many cases it can be adequately simulated as additive 
Gaussian. The symbol {ct„} in Eq. 3 denotes an estimation of the standard deviation 
of noise. In the off-line denoising mode of the algorithm, noise can be estimated from 
the standard deviation of the sparsest spectral band (a method also used in wavelet 
denoising). The sparsity of each band is assessed by its Kurtosis value. To be more 
specific, the kurtosis of the pdf of each band is derived and multiplied with its standard 
deviation. In the band with the lower value we fit a Gaussian. The mean value and standard 
deviation derived by Maximum Likelihood Estimation are used as a mean value {cr} 
for the noise pdf in every band. In order to address the frequency dependent SNR of 
coloured noise in an online fashion, an averaged noise spectrum is assessed according 
to a first order digital low-pass filter Nk,i{u!^) \ = p\Nk,i-i\{u!K)\ + (1 - p)\Nk,i{u!K)\, 
where k = 0, . . . , N denoted the band,{i} the current sample and 0.8 < p < 0.95. 
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2.3 Amplitude Estimation 

Inserting the density models in Eq.3 and taking the derivative of the log-likelihood with 
respect to HS{uJk) gives the MAP estimate of the underlying undistorted spectrum (see 
Appendix for details in derivation). 

For the Gamma pdf the corresponding spectral estimator is set as: 

^ _ bX{iv^) -bvin- dl , ^y{bX{oJ^) - bmn - + 4&V2(a 

For the Gaussian Laplacian pdf the estimator is: 

H*5’(cU/^)j7iap — . dlji b(7j^) (10) 

i + a(T„ 

For Hyvarinen’s pdf the estimator is set as: 



(9) 



HS'(w„) 



map — 



X{oj^) -bd-run 
2 



y/(X(oji^} + bd- m„)2 - 4cr2(a -F 3) 
2 



(11) 



For the Gaussian pdf the amplitude estimator is: 

2 2 
HS{uj^)^ap= 2 ^"^ ~ 

< + <^s < + 

One can clearly see the effect of a subtracting effect, ‘shrinking’, on X{u},^) where 
the magnitude of the shrinkage depends on estimated noise variance and the expected 
distribution of the underlying clean spectral band. 

There are several key-points that provide a strong impetus to the Bayesian viewpoint: 
Bayesian formulation allows a structured approach towards regulating the trade-off be- 
tween distortion of spectral balance of the processed speech signal and noise suppression 
factor. Moreover, the Bayesian framework makes explicit use of the underlying prob- 
ability distribution of the coefficient in each band, which is essential in cases where 
the speech signal is completely masked by the overwhelming noise. Additionally, our 
technique has the advantage of not requiring and adaptation stage to the new operational 
environment. 

Our algorithmic derivation are closely connected with [8] [9] and wavelet shrinkage 
method [10]. It resembles spectral subtraction in the spectral domain [11] but it differ- 
entiates itself on being fully parametric and on deriving automatically its parameters by 
probability manipulation thus avoiding the error prone procedure of empirically tuning 
the thresholds. 



3 Experimental Results 

Our Bayesian framework makes the assumption that the distribution of noise is Gaus- 
sian. Therefore, we expected some impact on its effectiveness for noise cases exhibiting 
divergence from the normality assumption. In order to assess the robustness of our tech- 
nique for noise type that exhibit a moderate departures from the normality assumption 
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we conducted denoising experiments against artificially generated white Gaussian noise 
and five real coloured types of noise taken from the NOISEX-92 database. As regards 
the SNR of the recordings: each noise type is added to 34 clean speech files of 5 sec. 
mean duration so that the corrupted waveform ranges from -10 to 20 SNRiib- 

Word Recognition Accuracy was assessed by using a speech recognition module built 
with HTK Hidden Markov Models toolkit. The basic recognition units are tied state con- 
text dependent triphones of five states each. Given this set of HMMs, and the correspond- 
ing dictionary, the HTK recognition unit produces the best path of the word network 
using the Frame Synchronous Viterbi Beam Search algorithm. The testing set consists 
of 100 files, part of the identity card corpus of the SpeechDat database (one speaker for 
each recording). The baseline word recognition accuracy under the clean acoustic envi- 
ronment is 94%. After applying the enhancement procedure, a 20 Mel-spaced triangular 
band-pass filter-bank is imposed to the spectrum. Thirteen dimensional feature vectors 
are formed after applying DCT to log-filter-bank outputs, which reduces the 20 output 
channels into 12 dimensional MFCC features plus a log-energy value. Cepstral mean 
normalization was applied to deal with the constant channel assumption. Deltas and 
double-Deltas were concatenated to form the final 39-dimensional observational vector. 
The results of the tests conducted are depicted in Table 1 and demonstrate that our en- 
hancement method cooperates well with the HMMs framework, achieving consistently 
good results in very low SNRs. The best performance is for the white Gaussian type 
of noise. We attribute the extensive denoising capability to the fact that white Gaussian 
noise complies with the assumptions of the algorithm, more than any other type of noise. 
Good performance is observed for coloured types of noise, though, it seems that they 
form a category of their own compared to the Gaussian noise. 

Our approach is formulated in the linear-spectral domain, therefore depending on 
the application, can be used for improving the quality of perception. In such case we 
ought to respect the specific idiosyncrasies of human speech hearing and, fherefore. 



Table 1. Word Recognition Accuracy (%). 
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Fig. 1. Noisy and enhanced, time and frequency domain signals at 0 dB input SNR for the Gaussian 
type of noise. 



reconstruct the time-domain signal. We focus on the short-time amplitude of the speech 
signal leaving the noisy phase unprocessed based on the assumption that the human ear 
does not perceive phase distortion. Phase is added back after the enhancement procedure 
is applied and the time domain signal is subsequently reconstructed using the overlap 
and add method. Figure 1 shows the spectrogram of clean speech waveforms and the 
corresponding noisy versions corrupted by Gaussian, noise and the corresponding en- 
hanced versions at 0 dB input SNR. The figures demonstrate extensive noise reduction. 
At lower SNRs, close lookup and parallel listening tests reveal perceptible distortion 
in the form of musical noise and impulsively occurring components for the coloured 
types of noise which cannot be suppressed, a fact that we attribute to the violation of the 
stationarity assumption. 

4 Conclusions 

We wish to emphasize the practical utility of our approach, which does not seek to revise 
already existing and successful front-ends that include in their construction an FFT stage 
(e.g. MFCC, PLP). As our framework works on a frame level basis on the spectral vectors 
already extracted for the recognition purpose, the restoration process comes at little to no 
extra computational overhead, thus achieving real time, on line performance. Extensive 
experimentation with this technique gave excellent results in the case of white Gaussian 
noise in terms of word recognition performance and most important, the preservation of 
natural sound. Further work focuses on incorporating a mixture of Gaussians into the 
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noise model, which also results in close-form MAP solutions as well as better estimation 
techniques of the noise variance that can be directly inserted in Eq. 9-Eq. 12. 
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Abstract. This paper concerns an influence of a filter shape and a benefit of the 
Hertz-Bark transformation to the word error rate (WER) obtained in a telephone- 
based speech recognition application working with the Perceptually-based Linear 
Predictive (PLP) parameterization. Live various shapes of filters (rectangular, nar- 
row and wide trapezium, triangular and the classical PLP filter shape [1]) were 
compared and an effect of a nonlinear frequency transformation between Hertz 
and generalized Bark axis was explored. Experiments with 100 speakers and with 
the vocabulary size of 475 words were performed. During all experiments only 
the zero-gram language model was used to see better an influence of particular 
variables to changes of the WER. 



1 Introduction 

To design an optimum front-end for an automatic speech recognition is still a great effort 
of many research teams all over the world. Presented paper wants to contribute partly 
to these investigations. The first goal of the work was to judge an influence of various 
filter shapes used in the Perceptually-based Linear Predictive (PLP) parameterization 
to the recognition accuracy. The second goal was to verify whether an adjustment of 
the front-end to a nonlinear perceiving of frequencies by a human hearing organ (Bark 
frequency warping) is the best solution. 

All experiments were performed on a continuous speech database pronounced by 
100 speakers over a telephone channel. Because speakers called from various places of 
the Czech Republic the transfer conditions (noise, distortion e.g.) were generally slightly 
different for each call. Only the zero-gram language model was used during recognition 
experiments to better see a behavior of a word error rate (WER) caused by an adjustment 
of the front-end. 

This paper is organized as follows: Section 2 describes the recognition engine and 
experiment conditions used during our work. Section 3 provides an information about 
used PLP parameterization. Individual filters tested in our experiments are described and 
an influence of their shapes on recognition results is discussed in Section 4. Section 5 
gives an information on the effect of the nonlinear frequency transformation. Conclusions 
are given in Section 6. 

* Support for this work was provided by the Ministry of Education of the Czech Republic, 
project No. MSM234200004, and by the Grant Agency of the Czech Republic, project No. 
102/96/K087. 
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2 Recognition Engine and Experiment Conditions 

The recognition experiment were performed with the recognition engine which is a part 
of a telephone dialogue system [3] built at the Department of Cybernetics, University 
of West Bohemia, Pilsen. The recognition engine is based on a statistical approach. 
It incorporates front-end, acoustic model, language model and decoding block, that 
provides search for the best word sequence which matches the incoming acoustic signal. 
The basic speech unit of our system is a triphone. Each individual triphone is represented 
by a three states HMM with a continuous output probability density function assigned to 
each state. At present we use 8 mixtures of multivariate Gaussians for each state. As the 
number of Czech triphones is too large, phonetic decision trees were used to tie states 
of Czech triphones. The digitalization of an analogue telephone signal was provided by 
a telephone interface board DIALOGIC D/2 ID at 8 kHz sample rate and converted to 
the mu-law 8-bit resolution format. The aim of the front-end processor is to convert a 
continuous speech into a sequence of feature vectors. The parameterization process can 
run either on the Mel-Frequency Cepstral Coefficients (MFCCs) or the PLP coefficients. 
The pre-emphasized acoustic waveform is segmented into 25 millisecond frames every 
10 ms and the parametric representation is computed after Hamming windowing. 

The decoder uses a crossword context dependent HMM state network, which is 
generated by a Net generator. The input of the Net generator is a text grammar format 
represented by an extended BNF that respects the VoiceXML description. The whole 
net consists of one or more connected grammars. The decoder uses a Viterbi search 
technique with an efficient beam pruning. 

Because a variety of noise sounds, e.g. load breath, noise of a telephone channel 
can appear in an utterance a set of noise HMM models was introduced and trained in 
order to capture these noise sounds. The speech material for all experiments was taken 
from the Czech telephone corpus collected at the Department of Cybernetics. The corpus 
consists of a read speech transmitted over telephone channel. One hundred speakers were 
asked to read 40 sentences. These sentences were selected from Czech newspapers in 
order to contain the most occurring triphones of the Czech spoken language. The corpus 
obtained was manually annotated and phonetically transcribed. Then it was randomly 
divided so that 100 sentences created the test part and the remaining part of the corpus 
formed the training part. The vocabulary of our task contained 475 different words. 
Since several words had multiple different phonetic transcriptions the final vocabulary 
consisted of 525 items - different phonetic forms of words and moreover 3 additional 
non-speech events (NOISE, LOAD_BREATH, and LIP_SMACK). In all recognition 
experiments a language model based on zero-gram was applied. It means that each word 
from a vocabulary is equally probable as a successor of a given word in the recognized 
utterance. The perplexity of the task was thus 528. 



3 PLP Parameterization 



The front-end used in our experiments works with the PLP parameterization [2]. For the 
transformation of a power speech spectrum to a corresponding auditory spectrum the 
PLP combines three components from the psychophysics of hearing: the critical-band 
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spectral selectivity, the equal-loudness curve and the intensity-loudness power law. Now 
we briefly mention the steps and settings used on this parameterization. 

a) Short-term speech spectrum. The speech signal is weighted by the Hamming window 
and the short-time power spectrum is computed from the short-time signal spectrum. 

b) Nonlinear frequency transformation and critical-band spectral resolution. Modeling 
of these phenomena is in the PLP performed either by the nonlinear transformation of 
frequencies from the Hertz into the Bark scale (see equation (2) for k= 1) and by the 
construction of masking curves that simulate critical-band of hearing and are modeled 
by the band-pass filters. Let us mention that filters shapes in the original source were 
designed in the form described in Figure 1 , part E. The centers of filters are spaced in the 
Bark domain linearly with the step approximately 1 Bark. As the speech signal covers 
the range from 0 to 4 kHz (sampling frequency was 8 kHz) the corresponding range in 
the Bark scale was 0 to 15.57 Bark and we used M=17 filters spaced linearly with the 
step of 0.973 Bark. The 0*^ filter had a center in the value 0 Bark, the last (M — 1)®* 
filter was centered in the value 15.57 Bark. 

c) Critical-bands adjustment to the curves of equal-loudness. To adjust the power spec- 
trum to the non-equal sensitivity of human hearing at different frequencies we built up 
the function which provides an approximation to the equal loudness. It is the transfer 
function with asymptotes of -h 40 dB/decade for 0 to 400 Hz, H-0 dB/decade for 400 to 
1200 Hz, h- 20 dB/decade for 1.2 to 3.1 kHz and -i-O dB/decade for 3.1 to 4 kHz. 

d) Weighted spectral summation of power spectrum samples. In this step the computation 
of the weighted power spectrum for the outputs of the individual critical band filters is 
performed. 

e) Enforcing the intensity-loudness power-law. The outputs of the critical band filters 
were subjected to an operation which approximates the power-law of “hearing” de- 
scribing a relation between the intensity of the sound and its perceived loudness. The 
amplitude compression expressed by the third root was used in this case. 

f) All-pole spectrum approximation. In the following step of the PLP analysis the third 
roots of the outputs of the critical band filters were approximated by an all-pole model. 
Firstly, the values of an autocorrelation function R{i) were computed and then using 
Durbin’s algorithm the PLP predictive coefficients aif), where /=0, ... ,Q, were final- 
ized. The optimal order of the predictor Q was estimated on the base of large experiments 
(results of this work are prepared for publication [4] ) and was set to Q=1 . 

g) Transformation of the PLP-coefficients to the PLP-cepstral representation. The PLP- 
cepstral coefficients c(l), ... , c(Q) were computed by the standard approach from the 
Q PLP predictive coefficients. For the final speech modeling we extended the original 
PLP-cepstral representation with derived delta and delta-delta features. In fact the di- 
mension of the pattern space in which the acoustic models of triphones were built was 
3Q=21. 



4 Influence of the Filter Shape on Recognition Results 



The goal of our first experiments was to explore how the shape of critical band filters 
influences the resulting accuracy of recognition. Five different shapes of filters (see 
Figure 1) were used in our experiments - rectangular (A), narrow trapezium (B), wide 
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Fig. 1. Filter shapes used in the recognition experiments (A-rectangular, B-narrow trapezium, 
C-wide trapezium, D-triangle, E-classical PLP shape). 
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trapezium (C), triangular (D) and the classical PLP filter shape (E). For an evaluation of 
recognition results we used the standard measure - the accuracy {Acc) dehned as 

Acc= {N - D - S - I)/N xim%, (1) 

where N is the total number of events (words) in the reference transcription, S is the 
number of substitution errors, D is the number of deletion errors and I is the number of 
insertion errors. The results of recognition experiments are summarized in Table 1. 



Table 1. The results of recognition experiments with the live various filter shapes 



Filter 


A 


B 


C 


D 


E 


Accuracy 


82.17 


82.94 


81.84 


83.33 


83.01 



It is evident that differences in the recognition accuracy among particular filters are 
relatively small however slightly better results were obtained not for the classical PLP 
filter but for the simple triangular one. 

5 Effect of the Nonlinear Frequency Transformation 

In this task we explored whether the nonlinear transformation of frequencies from the 
Hertz into the Bark scale is the best adjustment for the speech recognition system. 
We tried to performed several experiments in which we warped the original frequency 
measured in Hertz into the so called “generalized Bark” axis. The transformation depends 
on the coefficient k and can be expressed (see [1]) by the relation 




where / is the frequency measured in Hertz, is the frequency measured in the 

generalized Bark and k is the parameter. For k= 1 we obtain the standard Bark scale, for 
K <d the equation represents nearly the linear transformation of frequencies. 

Figure 2 shows curves representing transformation from the Hertz scale into the 
generalized Bark scale for different values of k and Figure 3 depicts corresponding filter 
coverage (we used the standard PLP filter shape, see Figure 1, part E). The results of 
experiments are given in Table 2 and depicted in Figure 4. 



Table 2. The results of recognition experiments for individual values of k 



K 


0.01 


0.1 


0.2 


0.6 


1 


2 


5 


10 


100 


Accuracy 


80.61 


80.48 


81.13 


81.19 


83.01 


81.97 


79.77 


75.49 


67.77 
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Fig. 2. Transformation curves between the Hertz and the generalized Bark scale. 
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Fig. 3. The filter coverage for the parameters k= 0.1, 1 and 10. 




Fig. 4. Dependence of the recognition accuracy on the parameter k. 
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6 Conclusion 

Results of speaker independent continuous speech recognition confirmed that the PLP 
technique, which combines several engineering approximations in order to adjust char- 
acteristics of the front-end to the human hearing, provides the best resolution of the 
nonlinear frequency warping. The optimal adjustment was obtained for the parameter 
K = 1, which corresponds to the standard Bark scale. The shapes of the filters used in 
experiments did not influence very much the recognition accuracy however the triangular 
shape of the critical band hlter gave the best results. It should be mentioned once more 
that all experiments were performed with 21 -coefficients of PLP-cepstral parametriza- 
tion (7 PLP-cepstral + 1 delta + 1 delta-delta), lower number of basic PLP-cepstral 
coefficients (including delta and delta-delta) provided worse results. 
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Abstract. The description of system for full-duplex speech communication through 
Internet is presented. The frame-independent codecs at rates 4800, 9600 and 19200 
bps were used for speech compression. The special protocol based on User Data- 
gram Protocol (UDP) was used for data transmission. The paper consists from 
three parts. In the first part we briefly show the description of our codecs. In the 
second part we give the description of our protocol for organization of full-duplex 
speech link. In the third part we present our experimental system for speech com- 
munication which uses codecs and protocol from the first and second parts. Our 
developments were used in the “etalkRadio” project, which provides the organi- 
zation of live shows in Internet. The detailed description of this project will be 
present on http : //www. etalkradio . net. 



1 Introduction 

Usually developers uses Transmission Control Protocol (TCP) to transmit data in appli- 
cations for speech communications through Internet. This protocol guarantees a delivery 
of all transmitted data and works fine in the case of reliable channel. If the communi- 
cation channel is not good some data can lose during transmission. To compensate the 
loss of data TCP sends it once again. As the result the delay for a transmission between 
sender and receiver grows and quality of dialog degrades fast. Every talker needs to 
wait several seconds or more for the answer. UDP eliminates this defect. This protocol 
allows to lose data during transmission, but permit to organize speech communication 
with fixed delay. 

Usually the applications for speech communications which works on the base of 
TCP uses Code Excited Linear Predictive (CELP) codecs [1,2,3]. This codecs provides 
acceptable quality of synthesized speech at medium rates (4800-9600 bps) and needs 
relatively small computations (5-10 Million Instructions Per Second (MIPS)). Unfor- 
tunately the usage of CELP codecs in applications on the base of UDP is difficult. To 
synthesis current speech frame in the receiver the CELP codec needs information about 
excitation signal from previous frame. If the previous frame was lost during transmis- 
sion the excitation signal for speech synthesis will be used from the last synthesized 
frame. In this case the synthesis procedure for current and all subsequent frames would 
be wrong. So the loss of only one packet leads to the end of dialog. Our codecs which 
uses in propose system doesn’t have this defect. Every frame in each codec encoding 
independently. The loss of any frame or group of frames doesn’t get the bug. Instead of 
all lost frames the pause will be played. After the receiving of subsequent frames the 
speech synthesis automatically continue. 
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In the first part codecs description are shown. Second part presents algorithm for 
establishing connection and speech transmission. In the third part the system for speech 
communications through Internet on the base of codecs and algorithm for connection 
from the first and second parts are shown. 

2 Codecs Description 

Speech signal (16 bit, mono, 8000 Hz) passes through band-pass filter which consists 
of two Batterworth 3rd order low-pass and high-pass filters. The filtered signal divides 
on sequence of windows. Each window has 1440 samples (180 ms). For transmission 
encodes 120 samples of previous window and 1320 samples of current window. 

Procedure for encoding window consists of three steps. On the first step a window 
divides on several blocks of equal length. For each block the linear predictive coeffi- 
cients (LPC) counts, quantizes and interpolates. Also counts coefficients for perceptual 
weighting filter. On the second step counts the initial parameters for adaptive code book 
(ACB). This parameters forms at the same time the optimal excitation signal for a part 
of first block. On the third step counts parameters of excitation signal for the rest part of 
encoding window. The search of optimal parameters for excitation signal on the second 
and third steps realized on the base of “analysis by synthesis” algorithm. The parame- 
ters for transmission are LPC before converted to line spectral frequencies (LSF), initial 
parameters for ACB and parameters which necessary for synthesis excitation signal for 
the rest of encoding window. At the receiver side the synthesis filter builds on the base 
of LPC. Then reconstructs excitation signal and passes through the synthesis filter. The 
synthesis speech signal goes from the output of the LPC filter. 

The LPC encoding procedure for each codec is identical. Recorded window divides 
on 6 equal blocks on 240 samples. For each block the following sequence of operations 
executes: Hamming weighting, counting LPC by Durbin algorithm, LPC quantization 
and interpolation, counting perceptual weighting filter coefficients. The LPC quantiza- 
tion and interpolation performs in the LSF domain. To encode LPC for one block needs 
34 bits and for whole window — 6 x 34 = 204 bits correspondingly. 

The counting of initial parameters for ACB performs for first 1 80 samples of encoding 
window. The initialization of ACB is the impulse sequence. This sequence at the same 
time is an optimal excitation signal for this part of window. Each impulse in the sequence 
characterizes by it’s amplitude and position. The detailed algorithm for searching impulse 
parameters see at [4]. Here we give only brief description. 

The search of parameters for each impulse performs at separate parts of equal length 
on which divides the 1 80 samples. For codecs at 4800 and 9600 bps 1 80 samples devides 
on 23 parts (22 parts with length of 8 samples and 23rd part with length of 4 samples). 
For codec at 19200 bps — 45 parts with length of 4 samples. On each part only one 
impulse is available. The number of bits for encoding first 180 samples for each codec 
presents in Table 1 . 

The algorithm of searching excitation signal parameters for residual part of encoding 
window ( 1440 — 180 = 1260 samples) performs at the 3rd step and for each codec has 
it own features. 
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Table 1. The number of bits for encoding first 180 samples of excitation signal 



Codec 


Amplitude, bit 


Position, bit 


Number of impulses 


Total, bit 


4800 


6 


3 


23 


207 


9600 


7 


3 


23 


230 


19200 


7 


2 


45 


405 



For codec at 4800 bps the 1260 samples divides at 21 parts by 60 samples. Four 
parameters counts on each part on the base of algorithm “analysis by synthesis”: delay 
and gain for ACB, gain and index for algebraic code book (AlgCB) [5]. The parameters 
for AlgCB counts only at even parts (0,2,4,. . . ,20). The bits for encoding of residual part 
of window: 

- delay for ACB — 7 bits; 

- gain for ACB — 7 bits; 

- index for AlgCB — 9 bits; 

- gain for AlgCB — 6 bits; 

- encoding of 1 1 odd parts — 14 x 11 = 154 bits; 

- encoding of 10 even parts — 29 x 10 = 290 bits; 

- at all — 154 + 290 = 444 bits. 

For codec at 9600 bps the residual part of window divides at 42 parts by 30 samples. 
The excitation signal for even parts encodes by delay and gain for ACB. Three extra 
impulses adds at every odd part. The first impulse can be placed only at first 10 samples, 
the second — at second 10 samples and the third — at third 10 samples. The bits for 
encoding the residual part of window: 

- delay for ACB — 7 bits; 

- gain for ACB — 7 bits; 

- impulse position — 4 bits; 

- impulse amplitude — 7 bits; 

- encoding of 21 odd parts — 47 x 21 = 987 bits; 

- encoding of 21 even parts — 14 x 21 = 294 bits; 

- at all — 987 + 294 = 1281 bits. 

For codec at 19200 bps the residual part of window divides at 84 parts by 15 samples. 
The excitation signal for each part encodes by the following parameters: delay and gain 
for ACB and two impulses. For last 84 part encodes only parameters of ACB. The bits 
for encoding of residual part of window: 

- delay for ACB — 7 bits; 

- gain for ACB — 7 bits; 

- impulse position — 3 bits; 

- impulse amplitude — 7 bits; 
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- encoding of 83 parts — 34 x 83 = 2822 bits; 

- encoding of last 84 part — 14 bits; 

- at all — 2822 + 14 = 2836 bits. 

The number of bits for encoding whole window for each codec presents in Table 2. 



Table 2. Information about codecs 



Codec 


LPC, 

bit 


180 

samples, bit 


1260 

samples, 

bit 


Total, bit 


Real 
transmit 
rate, bps 


Compression 

ratio 


4800 


204 


207 


444 


855 


4750 


26.9 


9600 


204 


230 


1281 


1715 


9528 


13.4 


19200 


204 


405 


2836 


3445 


19139 


6.7 



3 The Protocol for Full-Duplex Speech Communications 

Usually the applications for speech communications through Internet doesn’t support 
users information about the state of remote computers. If the connection was established 
sometimes it is very useful to know the channel quality. This information can help user 
to make a decision to continue or to break and to try again the communication link if it 
is very bad. 

For estimation the quality of communication channel it is necessary to know average 
network delay Tn and number of lost packets Pl during transmission. To estimate 
this parameters we need to send several packets to remote computer and wait for their 
returning. Then, if we know the sending time ti and receiving time t 2 , the number of 
send Ps and returned Pr packets we can count this parameters: 

Tn = (^2 — U)/2, Pl = Ps — Pr- 

To get more adequate quality of communication channel it is necessary to return 
received sound packets. In this case the transmission rate doubles. So to minimize the 
rate we decide to transmit sound packets only at one side and for estimation of channel 
quality and exchanging of control information to use special service packets of a small 
size. 

The service packets send by groups. Each group consists of M packets. The packets 
in group sends with period Ti . The packet groups sends with period T 2 > M x Ti or if 
the current state changing. All packets in the group has equal data: number, state and user 
identifier. The number permits to ignore late packets. The state informs what must do the 
remote user and what do at the current moment the opposite side. User identifier permits 
to define packets by sign “my/not my”. The similar approach was used by us at [4]. 

Very important task during speech communication to support a short (as possible) 
and fixed delay for all dialog time. This delay (the time from which the remote side hear 
sound) consists of the following components: 
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- Z\Ti — time for recording window, ms 

- AT 2 — time for encoding window, ms 

- AT^ — time to transmit window through the net, ms 

- AT 4 — time to make play sound queue, ms 

- Z\Ts — time for decoding packets in play sound queue, ms 

In present system the following time values were chosen: 

- ATi = 180 ms; 

- AT 2 — depends from processor (for real time applications it must be small than 
ATi); 

- AT 3 — from our experiments the most probable range is [100,900] ; 

- AT 4 — to compensate net delay AT^ this value must be max(Z\T 3 ) = 900 ; 

- Z\Ts — depends from processor. 

On the computers with Pentium II processors the values Z\T 2 and AT^ can not to 
take into account. So if the value of AT 3 is minimum the total delay at one side will be 
1-1.3 seconds. At the good communication conditions the additional decrease of delay 
could be get by minimizing the AT 4 value. On the computers with a great number of hard 
tasks the considerable minimizing of AT 4 can became a reason of sound interruptions 
during playing. 

The choice of a such big size (180 ms) of analysis window conditioned by two 
reasons. First of all this permits to realize frame independent codecs. Secondly this 
increase the ratio of useful information to service information in each transmit packet. 
The process of establishing speech link consists of the following steps: 

1 . The service packets with CONNECTING state sends from computerl to computer2. 

2. After receiving this packets, computer2 goes to CONNECTED state and return this 
packets back to computerl. 

3. When packets returned to computerl it also goes to CONNECTED state. 

4. Now computers connected and the both can go to TALKING state. 

Algorithm for supporting speech link consists of the following steps: 

1 . Waiting for first sound packet with any number. 

2. Receiving first sound packet with any number. 

3. Waiting AT 4 ms: 

- AT 4 = {N -1) X ATi=4x 180 = 720 

- N — number of packets to make play queue (A = 5) 

- We hope that for this time we can receive next N — 1 packets. 

4. Because first packet can late, so more than N packets can be received for time AT 4 . 
To support fixed delay AT the startpacket number counts as: 

- startpacket = Niast — N + 1 

- waitpacket = startpacket 

- playpacket = startpacket 

- Niast — number of last received packet in list; 

- N — number of packets to make play queue; 

- waitpacket — waiting packet number; 
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- playpacket — number of last packet send to playing. 

IF list empty TH E N 

- startpacket = 1 

- waitpacket = startpacket 

- playpacket = startpacket 
Remember current time in start time. 

5. Decoding N packets in list and send them to playing. The playing begins after 
sending M packet (M < N). 

Cycle from 1 to iV : 

- Request the packet with waitpacket number from list. 

- IF packet presents THEN 

• take it from list; 

• delete all packets in list with numbers < waitpacket', 

• decoding packet. 

ELSE 

• insert pause instead of lost packet. 

- Send packet to playing 

- waitpacket = waitpacket + 1 

- playpacket = playpacket + 1 

6. Enter to the main cycle 
MAIN CYCLE: 

- IF state is DISCONNECTED THEN goto 1. 

- Request the packet with waitpacket number from list. 

- IF packet presents THEN 

• take it from list; 

• delete all packets in list with numbers < waitpacket', 

• decoding packet; 

• send packet to playing; 

• playpacket = waitpacket', 

• count number for next waitpacket. 

ELSE 

• IF (list empty) and (remote state is CONNECTED) THEN goto 1 . 

• counts playingpacket number; 

• IF playingpacket < waitpacket THEN 

* waiting 1/5 packet’s duration; 

* goto MAIN CYCLE. 

ELSE 

* waitpacket = playingpacket + 1; 

* request from list packet with number > playpacket but < waitpacket', 

* IF packet presents THEN 

• get it’s number and save it in playpacket', 

■ delete packets from list with smaller numbers; 

• decoding packet; 

• send it to playing; 

• update waitpacket', 

■ goto MAIN CYCLE. 
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Additional notes to algorithm: 

1 . The packets goes to list from receiver. 

2. The packets sorts in list by numbers. 

3. A packet number counts by timer. 

At bad conditions this algorithm permits to play sound with any long delay, but if 
the quality of link becomes good the algorithm begins to play sound at hxed short delay. 

4 System for Speech Communication 

On the base of described codecs and protocol for speech communication the simple 
experimental system was developed. This system permits to organize point to point 
speech link through Internet or local network. The system interface is shown at Figure 1 . 
The desired codec defines at the start of the system. Before connection a user must input 
IP address and port of remote computer in appropriate fields “Remote IP address” and 
“Remote port”. Then to realize connection the user must press “Connect” button. 





Remote IP address Remote port 

|192.168.10.5 1 20000 



Pause 



Disconnect 



Exit 

Delivered packets. % 
Deliver delay, ms 



jlOO 



Remote state; JTalking 
Local state ; 



Talking 



^Xj 

Remote IP address Remote port 



192.168.10.6 



20000 



Disconnect 



Exit 



Delivered packets. Z [100 

Deliver delay, ms |<63~ 

Remote state : 

Local state ; 



[T diking 



T alking 



Fig. 1. The interface of experimental system for speech communication. 



After successful connection the user goes to “Connected” state. In fields “Delivered 
packets, %” and “Deliver delay, ms” the user can see accordingly the percent of returned 
service packets and average time for packet transmission at one side. The Information at 
this fields updates every 6 seconds. In the “Connected” state the buttons “Talk”, “Pause” 
and “Disconnected” becomes active. If the user press “Talk” button this starts recording, 
encoding and sending sound to remote computer. To make a pause the user must press 
“Pause” button. To end dialog the user must press “Disconnect” button. In the fields 
“Remote state” and “Local state” the user can see the current state of remote and local 
computer. 
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5 Conclusion 

Our developments presented in this paper were used in project “etalkRadio”. This project 
uses to organize live shows through Internet. To know more about it you can visit 
http : / /www . etalkradio .net. 
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Abstract. This paper presents our experiments with automatic recognition sys- 
tem trained on Slovak SpeechDat-E* database. Automatic recognition system is 
based on reference recognition system developed by the COST 249 SpeechDat 
special interest group. It is automatic, language independent training procedure for 
training and also testing phonetic recognition systems for a SpeechDat-compliant 
databases. Our test results for Slovak SpeechDat(E) database are presented along 
with some results reported from COST 249 community. 



1 Introduction 

In our early recognition experiments [1] we have used TIMIT database for recogniser 
training. As the TIMIT database is in English language, these experiments didn’t have 
practical value for our speech environment. We have tried to create our own speech 
database for Slovak, but our possibilities and experiences with such task were limited. 

As a result of our efforts, the DaTUKE database was created [2]. This database 
consists of isolated digits and connected digit strings spoken by 50 male and female 
speakers. Our first recognition experiments in Slovak language were presented in [3]. 
However, this database wasn’t compatible with any other speech database. So, it wasn’t 
possible to compare the results we achieved with recognition results for other languages. 

As a consequence of continually growing number of SpeechDat compatible data- 
bases, SpechDat databases and standards were chosen as a source of multilingual speech 
data and as a standard for speech files, label files, annotation conventions, lexicon, etc. 
SpeechDat-E is a project in a series of European projects aiming at the creation of 
large telephone speech databases [4]. The aims of this project are SpeechDat-compliant 
databases for five Eastern European languages and one of databases is in Slovak lan- 
guage. The results of recognition system trained with Slovak SpeechDat database is 
already possible to compare with results for other languages. 

The fully automatic procedure represented by reference recogniser creates a set 
of acoustic phoneme models directly from SpeechDat database CD-Roms. It is based 

* Our usage of SpeechDat database is allowed for academic research on the base of co-operation 
between Department of electronic and multimedia telecommunication, FEI TU Kosice and 
the collector of database - Institute of Control Theory and Robotics, Department of Speech 
Analysis and Synthesis, Slovak Academy of Sciences, Bratislava. 
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on a boot-strapping procedure that works without pre-segmented data. This procedure 
uses HTK toolkit and accounts for differences between languages and imperfections in 
database creation. A test scripts for different typical applications are also included, along 
with the web site for exchange of software and results. 



2 The SpeechDat-E Database for Slovak 

The primary task within the SpeechDat project is the provision of spoken language 
resources recorded over telephone lines. SpeechDat databases are intended to be used for 
developing a number of applications such as information services, transaction services 
and other call processing services. 

This database includes telephone recordings of 1000 speakers recorded directly over 
the hxed PSTN using the digital ISDN interface. Speech files are stored as sequences 
of 8-hit, 8kHz A-law speech samples. Each prompted utterance is stored in a separate 
hie and each speech hie has an accompanying ASCII label hie. The transcription used 
in this database includes a few details that represent audible acoustic events (speech and 
non-speech, like speaker, stationary or intermittent noise) present in the corresponding 
waveform hies. This database covers main dialect regions in Slovakia. 

The SpeechDat-E corpus content is roughly following: 

- isolated digits, 

- connected digits, 

- dates/times, 

- money amounts, 

- natural numbers, 

- directory assistance names, 

- phonetically rich sentences, 

- spelled names/words, 

- yes/no questions. 

Train set consists of 800 sessions and remaining 200 sessions forms test set. 



3 Training Procedure 

To allow compare achieved results with other researches, whole process of recognition 
system training is driven by set of perl scripts called refrec (reference recogniser). The 
reference recogniser has been developed by the COST 249 SpeechDat special interest 
group [5]. It is automatic, language independent training procedure for training and test- 
ing phonetic recognisers for a SpeechDat-compliant databases. The reference recogniser 
training procedure uses tools from HTK toolkit. 

The hrst step in training procedure is training data preparation. This stage includes 
training session selection, removing utterances containing intermittent noise, truncated 
recordings, mispronunciations, unintelligible speech, phonetic letter pronunciations and 
feature hies generation. Eollowing hies are imported from SpeechDat-E database: A-law 
speech hies, the Sam format label hies, lexicon and a list of training sessions. A-law 
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speech sample files is necessary to convert to 16-bit linear speech samples to process 
with HTK toolkit. 

Mel frequency cepstral coefficients (MFCC) are used as an acoustic features for 
parameterisation of speech. Feature vector consists of 13 cepstral coefficients including 
the zero’th cepstral coefficient as energy, as well as first and second order deltas. 

The final part of data preparation is lexical processing. The lexicon is imported from 
SpeechDat database. It is converted to HTK format dictionary, optional stress information 
is removed, a phone mapping and silence and tee model adding is performed. Finally, both 
word and phoneme level MLF files are created and list of phonemes used in dictionary 
is generated. 

Phonemes are modelled with three state left-to-right HMM, with no skip transitions. 
Diagonal covariance Gaussians are used. Training starts with flat start prototype generat- 
ing. This prototype is initially boot-strapped to the global mean and variance of the whole 
boot-strap training subcorpus. Only phonetically balanced sentences are used for this 
boot-strap training. Such initialised prototype is cloned into phone mo-dels. Monophone 
models are then reestimated by Baum-Welch reestimation procedure. Next, tied silence 
and short pause (between-word silence) models are added to care of both background 
noise and silence. Phone-level transcription is improved by realigning the training data 
and outlier sentences that failed in forced alignment are removed from the training set. 

The initial monophone models are successively split and re-estimated into 2, 4, 8, 
16, 32 Gaussian mixture components. The 32-mixture monophones are used to segment 
the training set in another forced alignment. The obtained phoneme segment boundaries 
are then used to create entirely new monophones in an isolated word training style. 

From initialised single-mixture monophone models are built word-internal context 
dependent models for all triphones occurring in the training set. The monophone tran- 
scriptions are first expanded into triphones. For isolated word and connected word recog- 
nition word internal triphone transcription is used. In case of continuous recognition task 
the cross-word triphones are used. The monophone models are first cloned to triphone 
models, then reestimated with context-dependent supervision. In order to reduce the 
total number of HMM states state clustered triphones are created and decision tree state 
tying is performed. 

And finally, both the monophones and tied state triphone models are once again 
improved by Gaussian mixture splitting and reestimation up to 32 mixture components. 

4 Testing Procedure 

To insure comparability of results for various languages, the commonly defined test suite 
based only on the SpeechDat database itself is used. Therefore the official SpeechDat 
test sessions are used, with different subcorpora representing typical test applications. 
Following four tests have been realised: 

- I-test: Isolated digits recognition. The recogniser vocabulary is manually designed, 
with digit words (0-9) and synonyms. There is semantic mapping into the ten digit 
categories before scoring, so word error rates are the same as digit error rates. 

- Q-test: Yes/no recognition. This is an isolated word recognition test similar to I-test 
and there are only two semantic categories. 




240 



M. Marcinek, J. Juhar, and A. Cizmar 



- A-test: Application words. The vocabulary contains 30 words, most of the typical 
command words for list processing applications. The A-words were translated from 
English. This means that sometimes, small phrases are used to represent the semantic 
equivalent of an English command word. However, each such phrase is considered 
a word, so word error rate (WER) is in this case the same as sentence error rate 
(SER). 

- BC-test: Digit strings. The B utterances are supposed to contain short pause between 
each digit, while the C utterances are recorded in a more natural context. Since this is 
a continuous recognition task, a looped grammar with a fixed word insertion penalty 
is used. 

- 0-test: City names. The vocabulary is generated automatically, so that no out-of- 
vocabulary utterances occur. Beam search prunning is applied during recognition. 

- W-test: Phonetically rich words. 

For all these tests common test procedures are used. Noise markers are ignored and ut- 
terances with OOV, mispronunciation, unintelligible speech or truncations are excluded 
in all procedures. 

5 Results 

In this paper we present results for reference recogniser version 0.95 for Slovak Speech- 
Dat(E) database. Results are represented by Word Error Rates (WER) and for each test 
the best-achieved value of WER was considered. Summaries of our test results are shown 
together with results reported from COST 249 community in Table 1 . 

During training procedure relatively large number of models is generated as a result 
of various stages of the training process. A total number of models generated during 
training is 42. Figure 1 shows a typical progress of recognition results for individual 
models for digit strings test (BC-test). Following types of models were generated during 
training: 

- first pass monophones models, denoted as miniJX_Y (X is the number of mixture 
components and Y the number of training iterations), 

- second pass monophones models, denoted as mono_X_Y, 

- triphone models, denoted as tri_X_Y, 

- and finally tied states triphone model, denoted as tied_X_Y. 

In our experiments word-internal context dependent models were used. 

6 Conclusion 

The aim of our recognition experiments with Slovak SpeechDat database presented 
in this paper is extending results reported from the COST 249 community with next 
language. Presented results confirm the Slovak SpeechDat database is comparable with 
other SpeechDat databases. Our actual experiments are based on phoneme modelling. 
In future, we would also to examine modelling of other speech units (syllables). The 
used reference recogniser version 0.95 taken’! into account labelled non-speech acoustic 
events during training, but version 0.96 is already available, which takes into account 
this events and provides robustness against them during training. 
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Table 1. Recognition results in word error rates (in %) for SpeechDat-compliant databases, refer- 
ence recogniser version 0.95, for A-test column, 0-test & W-test column value in brackets is used 
subcorpora and number of vocabulary words respectively. 



Test language/ 
database 


Isolated 

digits 

(l-test) 


Yes/no 

(Q-test) 


Application 

words 

(A-test) 


Digit 

strings 

(BC-test) 


City 

names 

(0-test) 


Phon. rich 
words 
(W-test) 


Danish 


1.04 


1.14 


2.36 

(A1-3) 


2.30 


15.82 

(495) 


64.38 

(16934) 


English 


1.69 


0.78 


2.62 

(A1-3) 


3.93 


12.64 

(745) 


38.74 

(2528) 


German 


0.80 


0.00 


2.40 

(A1-31 


2.70 


6.00 

(374) 


8.70 

(2264) 


Swedish 


2.56 


0.55 


1.52 

(A1-31 


3.78 


12.37 

(905) 


35.21 

(3610) 


Swiss German 


0.51 


0.27 


1.06 

(A1-31 


3.10 


6.29 

(684) 


24.26 

(3274) 


Norwegian 


2.31 


0.53 


4.43 

(A1-6) 


5.87 


17.31 

(1182) 


34.73 

(3438) 


Slovenian 


4.15 


0.87 


4.86 

(A1-6) 


6.14 


9.33 

(597) 


19.25 

(1491) 


Slovak 


0.54 


0.00 


1.72 

(A1-3) 


2.66 


3.70 

(500) 


14.16 

(2568) 




Fig. 1. The progress of results for individual models (BC-test) 
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Abstract. The paper describes the development of a large vocabulary continuous 
speech recogniser for Slovenian language with SNABI database. The problems 
with inflectional languages when speech recognition is performed are presented. 
The system is based on hidden Markov models. Eor acoustic modeling biphones 
were used whereas for language modeling bigrams and trigrams were used. To 
improve the recognition result and to enable fast operation of the recogniser, 
speaker adaptation is also used. The optimal system with the adapted acoustic 
model and bigram language model achieved word accuracy of 91.30% at near lOx 
real time. The unadapted system with the trigram language model achieved the 
word accuracy of 89.56%, hut it was also slower than the optimal system. Its run 
time was 15. 3x real time. 



1 Introduction 

In this paper, we will present the development of a large vocabulary continuous speech 
recognition system for Slovenian language based on hidden Markov models. According 
to our knowledge, there are no reports on large vocabulary continuous speech recognisers 
for Slovenian language. There are reports on recognition [1] with medium vocabulary 
(3000 words) of narrow topic (weather report) and recognition with less complex systems 
of Slovenian SpeechDat(II) [2] telephone database within COST 249 project [3,4] and 
of SNABI [5] and GOPOLIS [6] databases. 

As our long term goal is to use such a system in a real time application, as a computer 
dictation machine, the speed of recognition is also important. The main problem in the 
development of the large vocabulary speech recognition system for Slovenian language 
is due to the fact, that Slovenian is an inflectional language and so the recognition task 
is in many aspects harder than for noninflectional languages. This causes rapid growth 
of the vocabulary when attempts are made to reduce the out of vocabulary (OOV) rate. 
The OOV rate is an important factor in the attempts of increasing the word accuracy. In 
the development of the system, the studio recorded version of the Slovenian database 
SNABI [7] was used. 

The paper is organized as follows. The databases are presented in Section 2. Char- 
acteristics of the developed recogniser are described in Section 3. The test results are 
presented in Section 4, while the conclusion is made in Section 5. 
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2 Database 

2.1 Audio Corpus 

The studio version of the database SNABI [7] was used for developing acoustic models. 
The database consists of the speech material recorded by 56 speakers in the studio 
environment. 35 male and 21 female speakers were used. The total duration of recordings 
is approximately 16 hours of speech in 15869 sentences. The average length of sentences 
in different subcorpuses is between 7.5 and 9 words. The speech is downsampled to 
16kHz with 16 bit. Recordings of 4 speakers in length of 1 hour were used for the test. 
This was excluded from the training material. The test material for all the 4 speakers 
consists of 1034 sentences. 510 sentences of 1 speaker were separated from the training 
set and used for speaker adaptation. The test set for the adapted speaker consists of 248 
utterances. The database also contains a subcorpus with phrases used in the area of office 
automation, which can be applied as commands for the dictation task. 

2.2 Text Corpus 

The text collected from Slovenian national newspaper Vecer is used as a text corpus for 
development of the language model. There are 14M words in the text corpus, 138k of 
them are distinct. The corpus covers a wide area of different topics. Because the size 
of the recognition vocabulary is limited, the out of vocabulary words are problematic, 
especially for inflectional languages like Slovenian. 

As we can see in Figure 1 , although the size of the recognition vocabulary is increased 
to 100k, the covering of the test set words is still 87.6%. This means that the OOV rate 
is 12.4%. This is due to the fact that in the inflectional language each word (lemma) has 
many forms. The easiest way to handle this problem would be to take all the words from 
the test set and close the recognition dictionary [8]. Thus the influence of the OOV words 
would be eliminated from results. We have decided to use another approach, where there 
would be no direct influence of the test set on vocabulary. The text corpus is weighted 
with the added copies of orthographic transcription of the audio training corpus used. 
The number of copies was empirically chosen on the basis of the most frequent words in 
the text corpus. With this the topic of the audio training corpus is emphasized in the text 
corpus and the OOV rate of the test set for 4 speakers at 20k words in the recognition 
dictionary is reduced from 27.4% to 6.4% . Without this influence of orthographic 
transcription of audio training corpus, some rare inflectional forms were not in the most 
frequent 20k words. The OOV rate of the test set for 1 speaker, which is used for speaker 
adaptation, is reduced from 23.8% to 4.7% at the same size of the recognition vocabulary. 
The initial OOV rate for 1 speaker is lower, than the OOV rate for all 4 speakers, because 
the test sentences for each speaker were not equal. The success of this method depends 
on the structure of the database and was already found effective in [9]. The OOV rate 
can also be reduced with the use of morphemes [10] or linguistic rules [11]. 

3 Characteristics of the Recogniser 

The development procedure used for building the recogniser is based on previous ex- 
perience with the similar English recogniser [9]. The HTK toolkit [12] is used for the 
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number of words 



Fig. 1. The dependency between the test set words and the number of words in recognition dictio- 
nary. The curve represents the covering of the test set words for unweighted text corpus. The dots 
represent the covering of the systems used 



development of acoustic models. The trigram recognition was performed with a mod- 
ified 2-pass decoder. The speech material was recorded in the studio environment thus 
there is no need for noise robust features. The speech signal is processed with the 20 ms 
long Hamming window with the shift of 10 ms. Each feature vector consists of 12 mel 
cepster coefficients and energy and of their first and second derivatives. 

Rare Slovenian phonemes from original set were mapped to more frequent ones. 
The simplified set consist of 31 different phonemes. Acoustic models are based on 3 
state left-right Gaussian continuous density HMMs. To achieve robust detection of silent 
passages, a special silence model with all possible transitions between states is used. 
In the next stage of the development, the separate silence detection system should be 
integrated in the recogniser. Because the speed of the recogniser is important, the cross- 
word biphones were designed. There are 872 different biphones. After the phonetic 
tree-based clustering was applied, there were 531 biphones with 922 states. This assured 
that the complexity of the system was for a decade smaller compared to the usage 
of triphones. The experiment on the development set has shown that the left-context 
biphones are superior to the right-context biphones. The number of Gaussian density 
probabilities was increased to 4 per HMM state to cover diversities in acoustic space. 

The language models used in the recogniser were based on bigrams and trigrams 
and were developed with the CMU-Cambridge toolkit [13]. The perplexity of language 
models was calculated on orthographic transcription of the test set. It is 128.94 for 
the bigram language model and 123.91 for the trigram language model. Due to lower 



Large Vocabulary Continuous Speech Recognizer for Slovenian Language 245 



complexity of the search space generated from the higram language model, the recogniser 
with such a model is faster. All recognisers were tested on HP 9000/785 workstation. 
Because the number of speakers in the database is too small, no gender dependent models 
[14] are derived from biphone HMMs. With the use of the maximum likelihood linear 
regression (MLLR) procedure [15] the speaker adaptation is performed instead of the 
gender adaptation. This can increase the performance of the recogniser, especially when 
a narrow search beam is used to achieve faster operation in recognition mode. There 
were 20k different words in the recognition vocabulary. 



4 Performance of the System 

The first version of tests was performed on unadapted acoustic models for all the 4 speak- 
ers in the test set. Beside the word accuracy and sentence accuracy, the time needed to 
complete the test was also measured. From the needed time and the length of the test 
utterances the real time factor (RT) was calculated. The speed of the recogniser was 
increased with the decrease of the width of the search beam. 



Table 1. Results of recognition for 4 speakers with unadapted HMMs and bigram and trigram 
language models 



System 


Lang. Model 


RT 


Word Acc.(%) 


Sent. Acc.(%) 


RCGl 


2g 


59.5 


89.96 


66.92 


RCG2 


2g 


23.7 


87.98 


65.18 


RCG3a 


2g 


11.8 


84.62 


62.86 


RCG3b 


3g 


15.3 


89.56 


75.44 


RCG4 


2g 


4.1 


52.45 


44.89 



As it can be seen in Table 1 , the baseline recogniser RCG 1 at conservative settings of 
pruning achieved word accuracy of 89.96%. Nevertheless, the time needed to complete 
the test was long. When more aggressive pruning was used, the system operated much 
faster, but the word accuracy decreased, as expected. The RCG3a runs with 1 1.8x RT at 
5.34% absolute decrease of word accuracy in comparison to the baseline system. Further 
decrease of the operating time below lOx RT is possible, as seen by RCG4, but the result 
of the recognition is much worse. To achieve the real time performance, modihcations 
of the recogniser are necessary in the future. When the trigram language model was used 
in the RCG3b system, the word accuracy increased to 89.56%, but the recognition time 
increased to 15. 3x RT, as well. The use of the trigram language model also signihcantly 
increases the sentence accuracy due to the improvement in word accuracy. 

To soften the influence of the narrow pruning on decrease of the accuracy, the models 
adapted with MLLR to one speaker were tested. The results are presented in Table 2. 
The adaptation to the speaker did improve the recognition performance of the recogniser 
at all stages of pruning. The adapted baseline system RCGl for 1 speaker achieved 
the increase of word accuracy by 2.69%. There is also a significant improvement in 
the sentence accuracy. The optimal adapted system RCG3a achieved a 91.30% word 
accuracy, which is still better than the unadapted baseline system RCGl that achieved 
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Table 2. Results of recognition for 1 speaker with unadapted (UNA) and adapted (ADP) HMMs 
with the bigram (2g) and the trigram (3g) language model 



System Models RT Word Acc.(%) Sent. Acc.(%) 



RCGl UNA-2g57.6 


91.02 


66.94 


ADP-2g 57.4 


93.71 


72.58 


RCG2 UNA-2g21.5 


90.40 


65.32 


ADP-2g 21.3 


93.10 


71.37 


RCG3a UNA-2g 10.6 


88.16 


63.71 


ADP-2g 10.3 


91.30 


69.76 


RCG3b UNA-3g 14.8 


91.05 


81.85 


ADP-3g 14.7 


92.14 


84.13 


RCG4 UNA-2g 4.9 


58.87 


46.37 


ADP-2g 4.9 


60.44 


46.77 



9 1 .02%. The optimal system operated at the 10.3x real time. The adapted trigram RCG3h 
system achieved the best word and sentence accuracy, but the 14.7x real time factor was 
worse, than by the optimal RCG3a adapted bigram system. The speaker adaptation did 
improve the accuracy of the fastest system RCG4, but the results are still not good 
enough to be acceptable. The time needed to complete the test for the adapted system 
was a bit smaller than for the same unadapted system, except for the RCG4 system. This 
improvement in speed is due to the fact, that because of the increased word accuracy the 
best path was found faster. 

The dependency between the real time factor and the word accuracy for all in paper 
presented systems is shown in Figure 2. Our system did achieve similar word accuracy 
as the comparative systems [8,16,17]. The Czech system in [8] achieved word accuracy 
of 81.6% and the German recogniser in [16] 92.5%. The whole system was weighted to 
the topic of SNABI database, but because the cross-word biphones are used, the topic of 
the system can be modified without the retraining of the acoustic models, which is the 
most time consuming operation during the development phase. 

5 Conclusion 

The development of a large vocabulary continuous speech recognition system for Slove- 
nian language based on HMMs was presented. The studio recorded part of the SNABI 
speech database was used. The language models were weighted to the topic of the acous- 
tic database. With this approach the out of vocabulary rate for the test set was reduced. 
The OOV rate is an important problem in speech recognition of the inflectional language 
like Slovenian. The system achieved comparable word accuracy. The use of the speaker 
adaptation hinders the loss of the recognition performance at higher pruning, which is 
necessary for a faster operation. The optimal system developed is a speaker adapted 
biphones system with a bigram language model. The trigram recogniser achieved better 
accuracy, but was slower than the optimal bigram system. To achieve a real time opera- 
tion in the future, which is needed for the application use, the recognition module must 
be further improved. 
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real time (RT) 



♦ 4SPK 
■ 1SPK-UNA 
A 1SPK-ADP 



Fig. 2. The word accuracy and real time performance dependency for all the configurations of the 
recogniser 
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Abstract. This paper studies the application of automatic phoneme classification 
to the computer-aided training of the speech and hearing handicapped. In partic- 
ular, we focus on how efficiently discriminant analysis can reduce the number 
of features and increase classification performance. A nonlinear counterpart of 
Linear Discriminant Analysis, which is a general purpose class specific feature 
extractor, is presented where the nonlinearization is carried out by employing the 
so-called ’kernel-idea’ . Then, we examine how this nonlinear extraction technique 
affects the efficiency of learning algorithms such as Artificial Neural Network and 
Support Vector Machines. 



1 Speech Impediment Therapy 

and Real-Time Phoneme Classification 

This paper deals with the application of speech recognition to the computer-aided training 
of the speech and hearing handicapped. The program we present was designed to help in 
the speech training of the hearing impaired, where the goal is to support or replace their 
diminished auditory feedback with a visual one. But the program could also be applied 
to improving the reading skills of children with reading difficulties. Experience shows 
that computers more readily attract the attention of young people, who are usually more 
willing to practice with the computer than with the traditional drills. 

Since both groups of our intended users consist mostly of young children it was 
most important that the design of the software interface be made attractive and novel. In 
addition, we realized early on that the real-time visual feedback the software provides 
must be kept simple, otherwise the human eye cannot follow it. Basically this is why 
the output of a speech recognizer seems better suited to this goal than the usual method 
where only the short-time spectrum is displayed: a few flickering discrete symbols are 
much easier to follow than a spectrum curve, which requires further mental processing. 
This is especially the case with very young children. 

From the speech recognition point of view the need for a real-time output poses 
a number of special problems. Owing to the need for very fast classification we cannot 
delay the response even until the end of phonemes, hence we cannot employ complicated 
long-term models. The algorithm should process no more than a few neighbouring 
frames. Furthermore, since the program has to recognize vowels pronounced in isolation 
as well, a language model cannot be applied. 
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In our initial experiments we focussed on the classification of vowels, as the learning 
of the vowels is the most challenging for the hearing-impaired. The software supposes 
that the vowels are pronounced in isolation or in the form of two-syllahle words, which 
is a more usual training strategy. The program provides a visual feedback on a frame- 
by-frame basis in the form of flickering letters, their brightness being proportional to 
the speech recognizer’s output (see Figure 1). To see the speaker’s progress over longer 
periods, the program can also display the recognition scores during the previous utterance 
(see Figure 2). Of course it is always possible to examine the sample spectra as well, either 
on a frame-by-frame or on an utterance-based basis. The utterances can be recorded and 
played back for further study and analysis by the teacher. 




Fig. 1. A screenshot from EasySpeech. The real-time response of the system for vowel /a/. 



This article describes the experiments conducted with the LDA and Kernel-LDA 
transforms, intended to improve and possibly speed up the classification of vowels. As for 
the classification itself we used neural nets (ANN) and support vector machines (SVM). 
The section below explains the mathematical details of the Kernel-LDA transform, which 
is a new non-linear extension of the traditional LDA technique*. 

2 Linear Discriminant Analysis with and without Kernels 

Before executing a learning algorithm it is a common practice to preprocess the data 
by extracting new features. Of the class specific feature extractors Linear Discriminant 

* In [4] this method bears the name “Kernel Fisher Discriminant Analysis”. Independently of 
these authors we arrived to the same formulae too, the only difference being that we derived the 
formulae for the multiclass case, naming the technique “Kemel-LDA”. Although we recently 
reported our results of Kemel-LDA on word recognition in [6], the method itself was not 
described in great detail. 
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Fig. 2. A screenshot of EasySpeech after pronouncing the word /mimi/. 



Analysis (LDA) is a traditional statistical method which has proved to he one of the 
most successful preprocessing techniques in classificaton^. The role of this method as 
preprocessing is twofold. Firstly it can improve classification performance, and secondly 
it may also reduce the dimensionality of the data and hence significantly speed up the 
classification. 

The goal of Linear Discriminant Analysis is to find a new (not necessarily orthogo- 
nal) basis for the data which provides the optimal separation between groups of points 
(classes). Without loss of generality we will assume that the original data set, i.e. the 
input data lies in R", denoted by Xi, . . . , Xr. The class label of each data vector is sup- 
posed to be known beforehand. Let us assume that we have k classes and an indicator 
function /() : {1, . . . , r} — >■ k}, where /(*) gives the class label of the point 

Xi. Let Tj (j G {1, . . . , fc}, r = ri + . . . + Tk) denote the number of vectors associated 
with label j in the data. In this section we now review the formalae for LDA, and also 
a nonlinear extension using the so-called ’Kernel-idea’. 



2.1 Linear Discriminant Analysis 

In order to extract m informative features from the n-dimensional input data, we first 
define a function t() : R" — >■ R which serves as a measure for selecting the m directions 
(i.e. base vectors of the new basis) one at a time. For a selected direction a a new real 
valued feature can be calculated as a^x. Intuitively, if larger values of r() indicate 
better directions and the chosen directions need to be somehow independent, choosing 
stationary points that have large values is a reasonable strategy. So we define a new basis 
for the input data based on m stationary points of r with dominant function values. Now 

^ One should note here that it can be directly used for classification as well. 
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let us define 

-eM"\{0}, (1) 

where B is the Between-class Scatter Matrix, while W is the Within-class Scatter Matrix. 
Here Between-class Scatter Matrix B represents the scatter of the class mean vectors 
/Xj around the overall mean vector ^ ^ while the Within-class Scatter 

Matrix W shows the weighted average scatter of the covariance matrices Cj of the 
sample vectors having label j : 

Cj = E/(i)=i(^i - /^j)(^i - Mj)^ ^ E/(i)=j Xi 

Since r(a) is large when its nominator is large and its denominator is small, the 
within-class averages of the sample projected onto a are far from each other, while the 
variance of the classes is small. The larger the value of r(a) the farther the classes will 
be spaced and the smaller their spreads will be. It can be easily shown that stationary 
points of (1) correspond to the right eigenvectors of W“^B, where the eigenvalues 
form the corresponding function values. Since W”^B is not necessarily symmetrical 
the number of real eigenvalues can be less than n and the corresponding eigenvectors 
will not necessarily be orthogonal^. If we select the m eigenvectors with the greatest 
real eigenvalues (denoted by ai , . . . , aj„), we will obtain new features from an arbitrary 
data vector y G M" by ai^y, . . . , am^y- 

2.2 Kernel-LDA 

Here the symbol "H denotes a real vector space that could be finite or infinite in dimension 
and we suppose a mapping ^ : R" — "H, which is not necessarily linear. In addition, 
let us assume that the algorithm of Linear Discriminant Analysis is denoted by P and 
its input is the points xi, . . . , Xr of the vector space M”. The output of the algorithm is 
a linear transformation R" — R™, where both the degree of the dimension reduction 
(represented by m) and the n x m transformation matrix are determined by the algo- 
rithm itself. P{xi , . . . , Xr) will denote the transformation matrix which results from the 
input data. Then the algorithm P is replaced by an equivalent algorithm P' for which 
P{xi, . . . , Xr) = P'(xi^xi, . . . , Xj^Xj, . . . , Xr^Xr) holds for arbitrary xi, . . . , Xr. 
Thus P' is equivalent to P but its inputs are the pairwised dot products of the inputs of al- 
gorithm P. Then applying a nonlinear mapping <P on the input data, yields a nonlinear fea- 
ture transformation matrix 'P'(t?(xi)^^(xi), . . . , ^(xj)^^(xj), . . . , ^(xr)^t^(xr)). 
These dot products can be computed here in TL (which may be infinite in dimension), but 
if we have a low-complexity (perhaps linear) kernel function k() : R" x R” — >■ R for 
which <?(x)^^(y) = k(x, y), x,y G R", then Xj)^^(xj) can also be computed 

^ Besides this, numerical problems can occure during the computation of if det(W) is 
near zero. The most probable cause for this could be the redundancy of feature components. But 
we know W is positive semidefinite. So if we add a small positive constant e to its diagonal, 
that is we work with W -f el instead of W, this matrix is guaranteed to be positive definite and 
hence should always be invertible. This small act of cheating can have only a negligible effect 
on the stationary points of (1). 
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with fewer operations (for example 0{n)) even when the dimensions of <?(xi) and t?(xj) 
are infinite. So, after choosing a kernel function, the only thing that remains is to take 
the algorithm V and replace the input elements xi^xi, . . . , Xi^xj, . . . , Xr^Xr with 
the elements k(xi, xi), . . . , k(x;, xj), . . . , K(xr, Xr). The algorithm that arrises from 
this substitution can perform the transformations with a practically acceptable com- 
plexity, whatever the spatial dimension. This transformation (together with a properly 
chosen kernel function) results in a non-linear feature extraction. The key idea here is 
that we do not need to know the mapping explicitly; we need only a kernel function 
k{) : R”xIR” — >• Mforwhichthereexistsamapping<?suchthat<?(x)^<?(y) = «;(x,y), 
X, y G R" . There are many good publications about the proper choice of the kernel func- 
tions, and also about their theory in general[7]. The two most popular kernels are the 
following (p G N+and a G R"*"): 

Ki(x,y) = (x^y + l)^, K 2 (x,y) =exp(-||x-y||VCT) . (3) 



Practically speaking, the original LDA algorithm is executed in a transformed (probably 
infinite) feature space H. where the kernel function k gives implicit access to the elements 
of this space. In the following we present the kernel analogue of LDA by transforming 
the algorithm V ioV' . Let us consider the following function for a fixed k, <P and TL. 



where the matrices needed for LDA are now given in 'H: 



b* = e 5 '., W* 

cf = ^ )(i-(x.) - 4)^ 4 



E k Vj 

r 



(5) 



We may suppose without loss of generality that a = X)i=i Oii^{xi) holds during the 
search for the stationary points of (4)"^. 

Now 



a^B^a = (ELi [eU T ( ~ El=i ^(xi 

([;^E/w=,<^(xi)^] - [^EU<^(xi)^)] (E:=i«.<^(xs)) 

Since a^B^a can also be expressed as a^K^ a where a = [ai, . . . , a^] and where 
the matrix is of size r x r, and for its elements with index (t, s) the following 

holds.: 



Kf. 



= E'^j=iT - [FEl=i«(xt,Xi)]) 

([;xE/w=j'«(xi,Xs)] - [iEI=i«(xi,x,)]) 



(7) 



This assumption can be arrived at in several ways, for instance we can decompose 
an arbitrary vector a into ai -|- a 2 , where ai is the component of a which falls in 
SPAN{${:ki), . . . ,<i>(xr)), while a 2 gives the component perpendicular to it. Then from 
the derivation of (a) it can be proved that a 2 ^a 2 = 0 for stationary points. 
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Then 



a^W*a= (EEi c^t^(xt)^) 


1 

Z^j—1 r ^f{ 


*)= 


(<?(xi)T - 


_i^E/(i)=j‘?'(xi)^_ 


)] 




rj S/(i)=i 



(ELi aM^s) 




We can now express a^W^a in the form a^K'^^a, where the matrix 
r X r and 



( 8 ) 



is of size 



Kr - 



= Ei=i I Em=j («^(xt,xi) - Ef(i)=j «(xt,xi)]) 
(k(x;,Xs)- ^E/(i)=i'«(xi,Xs) ) 



Combining the above equations we obtain the equality 

a^B^a a^K^^ 
a^W^'a “ aTK'^‘*‘a 



(9) 



( 10 ) 



This means that (4) can be expressed as dot products of ^(xi), . . . ,^(xr) and that 
the stationary points of this equation can be computed using the real eigenvectors^ 

of ^ K® . We will use only those eigenvectors which correspond to the m 

dominant real eigenvalues, denoted by a^, . . . ja™. Consequently, the transformation 
matrix A<p of Kernel-LDA is 



^ ^ ai'^(xi), ■ • ■ > ^ ^ ar'^(xi) 



i=l 



, 0k= ' 



k , 
li aj t 



;(xi,Xj) 



i=i j=i 



1/2 



(11) 



where the value of the normalization parameter 9 is chosen such that the norm of the col- 
umn vectors remains unity. For an arbitrary data vector y, new features can be computed 
via [ A a^Kixi, «r«(xi, y)j • 



3 Experimental Results 

Corpus. For training and testing purposes we recorded samples from 25 speakers, mostly 
children aged between 8 and 15, but the database used contained some adults too. The 
speech signals were recorded and stored at a sampling rate of 22050 Hz in 16-bit quality. 
Each speaker uttered 59 two-syllable Hungarian words of the CVCVC form, where the 
consonants (C) were mostly unvoiced plosives to ease the detection of the vowels (V). 
The distribution of the vowels was approximately uniform in the database. Because we 
decided not to discriminate their long and short versions, we worked with 9 vowels 
althogether. In the experiments 20 speakers were used for training and 5 for testing. 

Since in general K is a positive semidefinite matrix with its determinant sometimes near 
zero, it can be forced to be invertible using the technique presented in the subsection of LDA. 
Please see footnote 3 as well. 
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Feature Sets. The signals were processed in 10 ms frames, the log-energies of 
24 critical-bands being extracted using FFT and triangular weighting [5]. The energy of 
each frame was normalized separately, which means that only the spectral shape was 
used for classification. Our previous results showed that an additional cosine transform 
(which would lead to the most commonly used MFCC coefficients) does not affect the 
performance of the classifiers we had intended to apply, so it was omitted. Brief tests 
showed that neither varying the frame size nor increasing the number of filters gave any 
significant increase in classifier performance. 

In our most basic tests we used only the hlter-bank log-energies from the middle 
frame of the steady-state part of each vowel (“FBLE” set). Then we added the derivatives 
of these features to model the signal dynamics (“FBLE-tDeriv” set). In another exper- 
iment we smoothed the feature trajectories to remove the effect of transient noises and 
disturbances (”FBLE Smooth” set). In yet another set of features we extended the log- 
energies with the gravity centers of four frequency bands, approximately corresponding 
to the possible values of the formants. These gravity centers allegedly give a crude ap- 
proximation of the formants (“FBLA-tGrav” set) [1]. Lastly, for the sake of curiosity 
we performed a test with the feature set of our segmental model (“Segmental” set) [6]. 
This describes a whole phonemic segment rather than just one frame, it clearly could 
not be applied in a real-time system. So our aim then was simply to see the advantages 
of a segmental classiher over a frame-based one. 

Classifiers. In all the experiments with Artificial Neural Nets (ANN) [2] the well- 
known three-layer feed-forward MLP networks were employed with the backpropaga- 
tion learning rule. The number of hidden neurons was equal to the number of features. 

In the Support Vector Machine (SVM) [7] experiments we always made use of the 
radial basis kernel function K 2 (see eq. (3)). 

Transformations. In our tests with EDA and Kernel-LDA the eigenvectors belonging 
to the 16 dominant eigenvalues were chosen as basis vectors for the transformed space 
and for Kernel-LDA the third-order polynomial kernel ni, where p = 3 was used (see 
eq. (3)) . 

4 Results and Discussion 

Table 1 lists the recognition errors where the rows represent the five feature sets while 
the columns correspond to the applied transformation and classifier combinations. 

On examining the results on the different feature sets we saw that adding the derivative 
did not increase performance. On the other hand smoothing the trajectories proved 
beneficial. Most likely a good combination of smoothing and derivation (or even better, 
RASTA filtering) would give better results. 

As regards the gravity center features, they brought about on improvement, but only a 
slight one. This result accords with our previous experiments [3] . Lastly, the full segmen- 
tal model clearly performed better than all the frame-based classifiers. This demonstrates 
the advantage of modeling full phonetic segments over frame-based classification. 

When examining the effects of LDA and Kernel-LDA, it can be seen that a non- 
linear transformation normally performs better in separating the classes than its linear 
counterpart owing to its larger degree of freedom. One other interesting observation is 
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that although the transformations retained only 16 features, the classihers attain the same 
or better scores. Since the computation of LDA is fast, the reduction in the number of 
features speeds up not only the training but also the recognition phase. As yet, this does 
not hold for the Kernel-LDA algorithm we currently use, but we are working on a faster 
implementation. 

Finally, as regards the classihers, SVM consistently outperformed ANN by a few 
percentage. This can mostly be attributed to the fact that the SVM algorithm cope with 
overhtting, which is a common problem in ANN training. 



Table 1. Recognition errors for the vowel classification task. The numbers in parenthesis corre- 
spond to the number of features. 





none 

ANN 


none 

SVM 


LDA 

ANN 

(16) 


LDA 

SVM 

(16) 


K-LDA 

ANN 

(16) 


K-LDA 

SVM 

(16) 


FBLE (24) 


26.71 % 


22.70 % 


25.82 % 


24.01 % 


24.52 % 


21.05 % 


FBLE-(-Deriv (48) 


25.82 % 


24.01 % 


27.30 % 


24.34 % 


24.34 % 


21.21 % 


FBLE-l-Grav (32) 


24.01 % 


22.03 % 


24.67 % 


23.85 % 


22.87 % 


20.72 % 


FBLE Smooth (24) 


23.68 % 


21.05 % 


23.03 % 


21.87 % 


22.70 % 


19.90 % 


Segmental (77) 


19.57 % 


19.08 % 


20.04 % 


18.42 % 


18.09 % 


17.26 % 



5 Conclusion 

Our results show that transforming the training data before learning can definitely in- 
crease classifier performance, and also speed up classification. We also saw that a non- 
linearized transformation is more effective than the traditional linear version, although 
they are currently much slower. At present we are working on a sparse data representa- 
tion scheme that is hoped will give an order of magnitude increase in calculation speed. 
As regards the classifiers, SVM always performes slightly better than ANN, so we plan 
to employ it in the future. From the application point of view, our biggest problem at the 
moment is the feature set. We are looking for more phonetically-based features so as to 
decrease the classification error, since reliable performance is very important in speech 
impediment therapy. 
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Abstract. In this paper, ongoing work on the development of the speech recog- 
nition modules of MMIR environment for Dutch is described. The work on the 
generation of acoustic models and language models along with their current per- 
formance is presented. Some characteristics of the Dutch language and of the 
target video archives that require special treatment are discussed. 



1 Introduction 

Using speech recognition transcripts of spoken audio to enhance the accessibility of 
multi-media streams via text indexing and/or retrieval techniques has proven to be very 
useful. Automatic speech recognition is applied in various Multimedia Information Re- 
trieval (MMIR) systems (like for example [1]). Transcribing speech offers the opportu- 
nity not only to make audio content accessible via standard full-text retrieval and other 
advanced retrieval tools, but via the time-code of the audio part, also video fragments 
can be indexed on the basis of content features. In several ways the topic is part of 
the international research agenda: there has been a TREC-task for Spoken Document 
Retrieval (SDR) [9], there will be a video retrieval task at this year’s TREC, and also in 
the Topic Tracking and Detection evaluation event [16] speech transcripts are a primary 
source. 

There is large number of issues still to be solved in this domain of which the following 
two inspired the work described here. Eor an overview of other research themes, cf. [ 1 1 ] . 

1. One of the biggest challenges in building such a system is undoubtedly the devel- 
opment and implementation of a large vocabulary, speaker independent, continuous 
speech recogniser (LVCSR) for the specific language. This functionality is crucial 
for the support of disclosing e.g. video archives with programs on a broad domain, 
with a non-fixed set of speakers, such as news shows. 

2. Most existing MMIR systems or prototypes that use speech recognition are focussing 
on the English language and many of the IR paradigms can readily be applied to 
speech data from almost any language. However for non-English speech, often 
tailored recognition techniques are needed in order to let the generated transcripts 
be accessible or otherwise useful within a retrieval environment. 

This paper describes the development of an MMIR environment for Dutch video 
archives, taken up in a series of related collaborative projects in which among others 
both the University of Twente and the research organisation TNO participate. The focus 
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will be on the applied approach to speech recognition and on the requirements following 
from the envisaged applications for the projects DRUID (Document Retrieval Using 
Intelligent Disclosure) [12] and ECHO (European CHronicles Online). Whereas the 
DRUID project concentrates on contemporary data from Dutch broadcasts, the ECHO 
project aims mainly at the disclosure of historical national video archives. At the outset 
of DRUID in 1998 some experience with SDR was available from the OLIVE project in 
which the speech recognition for English, Erench and German was provided by LIMSI 
[12,13]. However, no speech recognition system for Dutch was available suitable for SDR 
research. We were given the opportunity to use the ABBOT speech recognition system 
[14] originally developed for English at the Universities of Cambridge and Sheffield. 
One of the major goals within DRUID is to port the ABBOT system to Dutch by de- 
veloping both language specific speech models and language models for Dutch. This 
implied firstly, that sufficient amounts of speech and text data along with an extensive 
pronunciation lexicon had to be collected. Secondly, language specific characteristics 
had to be considered thoroughly for determining system parameters, like the phone set 
to be used and the main vocabulary features. Einally, the system had to be tailored to the 
envisaged video retrieval task. Within ECHO the focus is on the additional requirements 
following from the historical nature of the target archives. 

In the next sessions we will give an overview of the work done on acoustic and 
language modelling for a Dutch speech recognition system for SDR, along with some 
preliminary evaluation statistics. We will specihcally go into some language character- 
istics of Dutch that are important in the language modelling process. 



2 Acoustic Modelling 

The speech recognition system ABBOT is a hybrid connectionist/HMM system [7]. The 
acoustic modelling is done with a recurrent neural net which estimates the posterior 
probability of each phone given the acoustic data. An indication of the performance 
of the acoustic modelling is obtained by unlinking the neural net from the system and 
looking at its phone classification performance that is typically expressed in phone error 
rate. 

2.1 Training Data and Performance 

The training data consists of 50 hours acoustic training data with textual transcripts and 
a phonetic dictionary. The former is a combination of some 35 hours of mainly read 
speech that was publicly available (Groningen Corpus and Speech-Styles Corpus) and 
an additional corpus of about 15 hours of read speech from newspapers, which was 
added in view of the target data for DRUID: news programs. On test data consisting of 
read speech only, the baseline performance was a 32% phone error rate. On broadcast 
news test data we achieved a phone error rate of 55% which is not surprising given 
the discrepancy between training and test data: broadcast news material contains both 
read and spontaneous speech, in studio but also in noisy environments, with broad- 
band as well as narrow-band recordings. Clearly, to achieve better results on broadcast 
news data, the models have to be adapted to the acoustic conditions and speech types 
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in this domain. For this purpose, the manual transcription of a substantial collection of 
broadcast news data has been started last year at TNO. Recently, also a first release of 
the Dutch national speech resource collection project ’Corpus Gesproken Nederlands’^ 
(CGN) became available. This corpus contains a variety of speech types in different 
contexts and a reasonable amount of data is used for the improvement of the acoustic 
models for the broadcast news domain. 

2.2 Acoustic Modelling Issues 

Historical Audio Data. Within the ECHO project, the speech recogniser is called in for 
the transcription of audio data, derived from historical video archives. Quality of this 
audio ranges from very low (before 1950) to medium (until 1965) and reasonable (until 
1975). Therefore, we expected our speech recognition performance to be considerably 
lower for this kind of data compared to contemporary data. Indeed, a preliminary test run 
showed a Word Error Rate of 68%. Since the collection became available only recently, 
we have not been able to do a thorough study, but a first glance at the data shows that 
we could improve acoustic modelling by adapting our models to one particular speaker 
(Philip Bloemendaal). This speaker is famous in the Netherlands because his voice is 
a characteristic element in an collection covering three decades of news items (called 
’Polygoon Journaals’, shown in cinemas) and this voice is present in a substantial part 
of the ECHO collection. 

Phonetic Dictionary. An indispensable tool in the acoustic model training process is a 
reliable phonetic dictionary to convert the words in the audio transcripts to their phonetic 
representations. The Dutch dictionary publisher Van Dale Lexicography provided us with 
a phonetic dictionary of 223 K words. These very detailed (and manually checked) pho- 
netic transcriptions that were generated using a phone set of about 200 different phones, 
served as a starting point for all our lexicon development steps. First, a grapheme-to- 
phoneme (G2P) converter was trained using a machine learning algorithm [6]. Given a 
word list of unseen words, without names or acronyms, the G2P transcribed 72% of the 
words to the correct transcriptions in the Van Dale phone set. From the 28% of the words 
that were wrongly transcribed, roughly 10% consisted of foreign words and words with 
spelling errors. Since the Van Dale phone set is much too detailed and therefore not 
suitable for speech recognition purposes, conversion tables were built to map the Van 
Dale phone set to the DRUID phone set, which is SAMPA with a few modifications. 
Furthermore, the phonetic dictionary was augmented by adding derivations of existing 
words and compounds for which we could predict the phonetic transcription with max- 
imum certainty. Finally, the words were added (from transcription tasks and language 
model vocabularies) that were processed by the G2P routine and were manually checked. 
The Van Dale format dictionary now contains 230K transcriptions. The automatically 
generated compound dictionary in DRUID format contains an additional 213K entries. 



* http://www.elis.mg.ac.be/cgn/ 
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3 Language Modelling 

For the language modelling of the speech recogniser we collected 152M words from var- 
ious sources. Dutch newspaper data (146M words) was provided by the ’Persdatabank’, 
an organisation that administers the exploitation rights of four major Dutch newspapers. 
In [2] LM perplexity on broadcast news test data is reduced considerably by adding tran- 
scripts of broadcast news shows (BNA & BNC corpus) to the LM training data. Since 
similar corpora are not available for Dutch, we started recording teletext subtitles from 
broadcast news and ’current affairs’ shows in 1998. On top of that the Dutch National 
Broadcast Foundation (NOS) provides the auto cues of broadcast news shows. Although 
the teletext material, and in a lesser degree the auto cues material, do not match as good 
as manual transcripts, they are a welcome addition to our data set. 

All data was first converted to XML and stored in a database to allow content selection 
(foreign affairs, politics, business, sports, etc.). A pre-processing module was build on 
top of the database to enable the conversion of the raw newspaper text to a version more 
suitable for language modelling purposes. Basically, the module reduces the amount 
of spelling variants. It removes punctuation (or writes certain punctuation to a special 
symbol), expands numbers and abbreviations, and does case processing based on the 
uppercase/lowercase statistics of the complete corpus. Finally, the module tries to correct 
frequent spelling errors based on a spelling suggestion list that was provided by Van Dale 
Lexicography. 

A baseline backed-off trigram language model was created using an initial version 
of the pre-processing module without spelling checking and with only a small portion 
of the available data. A 40K vocabulary was used that included those words from the 
top lOOK word frequency list of which also a manually checked phonetic transcription 
was available. The model was trained using version 2 of the CMU-Cambridge Statistical 
Language Model Toolkit [8] using Witten-Bell discounting. With this language model 
and our acoustic model based on read speech, a 34% word error rate (WER) was achieved 
on read speech and 58% WER in the broadcast news domain. 

3.1 Language Modelling Issues 

Recognition performance is expected to improve when all available data is used. How- 
ever, improvements on some typical language modelling issues in the broadcast news 
domain is necessary for a further decrease of the word error rate. Training on text data 
from domains that match the broadcast news domain, such as text transcripts of broad- 
cast news, seems to be of vital importance [2]. Although manually generated broadcast 
news transcripts are being collected at this moment (for acoustic modelling purposes), 
we expect that the size of this collection will not be sufficient to achieve language mod- 
els that are significantly better. Therefore, for the time being we rely on the teletext and 
auto cues data, which at least fairly matches the broadcast news domain. Additional 
performance improvement is expected from the special handling of certain linguistic 
phenomena as discussed in the next sessions. 

Compounds. In automatic speech recognition, the goal of lexicon optimisation is to 
construct a lexicon with exactly those words that are most likely to appear in the test 
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data. Lexical coverage of a lexicon should be as high as possible to minimise out-of- 
vocabulary (OOV) words, which are an important source of error of a speech recognition 
system. Experiments as in [15,10] show, that every OOV word, results in between 1.2 
and 2.2 word recognition errors. Therefore, a thorough examination of lexical coverage 
in Dutch is essential to optimise performance of the speech recogniser. In [5] lexical 
variety and lexical coverage is compared across languages with the ratio 

#words in the language 
#distinct words in the language 

which provides an indication of how difficult it is to obtain a high lexical coverage given 
a certain language. In general, the more distinct words there are in a language, the harder 
it is to achieve a high lexical coverage. When the ratios of two languages are compared, 
the language with the highest ratio has less difficulty in obtaining an optimal lexical 
coverage than the other language. In Table 1 the statistics found in [5] are given and 
those for Dutch are added (coverage based on the normalised training text). It shows 
that Dutch is comparable with German although lexical coverage of German is even 
poorer than lexical coverage of Dutch. The reason is that German has case declension 
for articles, adjectives and nouns, which dramatically increases the amount of distinct 
words, while Dutch has not. The major reason for the poor lexical coverage of German 
and Dutch compared to the other languages is word compounding [3,4]: words can 
(almost) freely be joined together to form new words. 



Table 1. Comparison of languages in terms of number of distinct words, lexical coverage and OOV 
rates for different lexicon sizes. PDB stands for Persdatabank, FR for Frankfurter Rundschau. 



Language 

Corpus 


English 

WSJ 


Italian 

Sole 24 


French 

Le Monde 


Dutch 

PDB 


German 

FR 


Total nr. words 


37, 2M 


25, 7M 


37,7M 


22M 


36M 


#distinct words 


165K 


200K 


280K 


320K 


650K 


ratio 


225 


128 


135 


69 


55 


5K coverage 


90,6% 


88,3% 


85,2% 


84,6% 


82,9% 


20K coverage 


97,5% 


96,3% 


94,7% 


93% 


90,0% 


65 K coverage 


99,6% 


99,0% 


98,3% 


97,5% 


95,1% 


20K OOV rate 


2,5% 


3,7% 


5,3% 


7% 


10,0% 


65K OOV rate 


0.4% 


1.0% 


1.7% 


2,5% 


4,9% 



Because of compounding in German and in Dutch, a larger lexicon is needed for these 
languages to achieve the same lexical coverage as for English. To investigate whether 
lexical coverage for Dutch could be improved by de-compounding compound words 
into their separate constituents, a de-compounding procedure was created. Since we do 
not have tools for a thorough morphological analysis, we used a partial de-compounding 
procedure: every word is checked upon a ’dictionary’ list of 217K frequent compound 
words that was provided by Van Dale Lexicography. Every compound is translated into 
two separate constituents. After a first run, all words are checked again in a second run, to 
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split compound words that remained because they originally consisted of more then two 
constituents. In Table 2 the amount of words and distinct words, ratio and lexical cov- 
erage of the top 20K, 40K and 60K vocabularies based on the complete newspaper data 
set before and after applying the de-compounding procedure are shown. As expected, 
de-compounding improves lexical coverage of all vocabularies significantly. Note how- 
ever that de-compounding typically produces more shorter words. Since shorter words 
tend to be recognised with more difficulty than longer words because of a larger acoustic 
confusion and de-compounding exactly shortens the longer compound words, the possi- 
ble improvement of recognition performance by decreasing the amount of OOV’s could 
be neutralised to some extend by a growing acoustic confusion. 



Table 2. Number of words, distinct words, ratio and lexical coverage of 20K, 40K and 60K 
vocabularies based on the original data set and after applying the de-compounding procedure 





#words 


#distinct words 


ratio 


20K 


40K 


60K 


Original 

De-compounded 


146.564.949 

149.121.805 


933.297 

739.304 


157.04 

201.71 


92.90 

93.92 


95.74 

96.59 


96.99 

97.69 



Proper Names and Acronyms. Proper names and acronyms deserve special attention in 
speech recognition development for Spoken Document Retrieval. They are important, 
information-carrying words but, especially in the broadcast news domain, also often 
out-of-vocabulary and therefore a major source of error. In general proper names and 
acronyms are selected in the vocabulary like any other word according to their frequen- 
cies in the development data. Following this procedure, almost 28% of our 65K lexicon 
consists of proper names and acronyms. We did a few experiments to see how well fre- 
quency statistics can model the occurrence of proper names and acronyms. Given a 65K 
lexicon based on the Persdatabank2000 data set (22M words), we removed different 
amounts of proper names and acronyms according to a decision criterion and replaced 
them by words from the overall word frequency list, thus creating new 65K lexicons. 
To measure lexical coverage and OOV rates, we took the training data itself and, since 
we do not have accurate transcriptions of broadcast news, a test set of 35000 words of 
teletext subtitling information from January 2001 broadcast news, as a rough estimate of 
the actual transcriptions of broadcast news shows. In Table 3, lexical coverage and OOV 
rates of these lexicons are listed. It shows that selecting proper names and acronyms like 
any other word according to their frequencies in the development data, works very well 
with a lexical coverage of 97,5% . The next step was removing a part of the proper names 
and acronyms and replacing them by normal words from the word frequency list. Nor 
selecting only 15% instead of 27,9% proper names and acronyms, nor selecting only 
those with a frequency of occurrence of at least N (where N was 100, 500 or 1000), nor 
removing all proper names and acronyms, improved performance in lexical coverage. 

Historical Data. For the processing of the historical data aimed at in the ECHO project, 
specific language models have to be created. The old-fashioned manner of speaking 
with bombastic language and typical grammatical constructions put specific demands 
on fhe language model. Also, fhe frequent occurrence of names and normal words that 
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Table 3. Lexical coverage and OOV rates of 65K lexicons created with different amounts of proper 
names from different sources. PDB stands for the Persdatabank2000 subset, N means frequency 
of occurrence in the training data. 



training data 


amount of proper names & acron 


lexical coverage 


OOV 


PDB 


27,9% 


97,5% 


2,5% 


PDB 


0% 


91,4% 


8,7% 


PDB 


15% 


97,2% 


2,8% 


PDB 


3,8% {N > 100) 


96% 


4,1% 


PDB 


0,7% [n > 500) 


94,2% 


5,9% 


PDB 


0,3% [n > 1000) 


93,3% 


6,7% 



are rarely used anymore requires special measures. Domain or time specific text data 
is needed to create both a lexicon and a language model that adequately ht the data. 
However, there is only a very small amount of text data available that is related to the 
test data in the collection. If there is any, it is available on paper only. A few attempts to 
convert this paper data into computer-readable text using OCR (Fine-Reader 5.0) failed: 
the pages are generally copies of carbon copies so the characters are too much blurred 
for OCR to be successful. 



4 Summary 

In summary, we have described the advances in the development of a speech recognition 
system to be used in the MMIR environment for Dutch video archives which started 
in 1998. The work carried out for the acoustic modelling was presented and some lan- 
guage characteristics of Dutch along with the related work on language modelling were 
addressed. Additional development tasks have been identified to improve the current 
performance level. Especially the historical data of the ECHO project turn out to impose 
a lot of additional requirements. 

A very first SDR demonstrator that uses the current speech recognition configuration, 
can be viewed at via http : //dis . tpd . tno . nl /druid/public /demos . html 
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Abstract. In order to improve patients’ life conditions and to reduce the costs of 
long hospitalization, the medicine is more and more interested in the telemonitor- 
ing techniques. These will allow the old people or the high risk patients to stay 
at home and to beneht from a remote and automatic medical supervision. We de- 
velop in collaboration with TIMC-IMAG laboratory, a system of telemonitoring 
in a habitat equipped with physiological sensors, position encoders of the person 
and microphones. The originality of our approach consists in replacing the video 
camera monitoring, hardly accepted by the patients, by microphones recording the 
sounds (speech or noises) in the apartment. The microphones carry out a multi- 
channel sound acquisition system which, thanks to the sound information coupled 
with physical information, will enable us to identify a situation of distress. We 
describe the practical solutions chosen for the acquisition system and the recorded 
corpus of situations. 



1 Introduction 

Telemedecine consists in associating electronic techniques of monitoring with computer 
“intelligence” and with the speed of telecommunications (established either through 
network or radio connections). Telemedicine is announced as a significant reform of the 
medical care because it allows to improve the response time of the specialists who could 
be informed about a medical emergency and react as soon as the first symptoms appear 
without waste of time. Telemedecine also allows a significant reduction of the costs of 
public health, avoiding the hospitalization of the patients for long periods of time. 

The system we work on is designed to survey the elderly persons. Its main goal is to 
detect serious accidents as falls or faintness (which can be characterized by a long idle 
period of the signals) at any place in the apartment [5]. This technique allows the medical 
center to analyze the information gathered by the telemonitoring system and to intervene 
if needed [2] . We noted that the elderly had difficulties in accepting a monitoring by video 
camera, because they considered that their constant recording was a violation of their 
privacy. Thus, the originality of our approach consists in replacing the video camera by 
a system of multichannel sound acquisition charged to analyze in real time the sound 
environment of the apartment in order to detect abnormal noises, calls for help or moans 
(falls of objects or of the patient) which could characterize a situation of distress in the 
habitat. 
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2 Presentation of the Telemonitoring System 

The habitat we used for experiments is a 30m^ apartment situated in the TIMC laboratory 
buildings, at the Michalon hospital of Grenoble. 

The patient carries a set of sensors which give information about his activity: vertical 
position (standing) or horizontal (lying) and sudden move (falling). The localization 
sensors with infra-red radiation are installed in each part of the apartment in order to 
establish where the person is at any moment. These sensors communicate with the 
acquisition system by radio waves and by bus CAN. The control of the activities sensors 
is ensured by a PC using a monitoring software programmed in JAVA. 



The sound sensors are represented by 8 microphones, their position is given in 
Figure 1 . An acoustic antenna composed of 4 microphones allows to monitor both the 
living room and the bedroom, aiming to localize the patient inside the two rooms and 
to analyze the sounds. As all the other rooms are much smaller, only one microphone 
per room is sufficient. The microphones used are omni-directional, condenser type, of 
small size and low cost. A signal conditioning card, consisting of an amplifier and an 
anti-aliasing filter is associated to each microphone. The acquisition system consists of 
a multi-channels acquisition card PCI 6034E of National Instruments, installed inside 
a second computer. The acquisition is made at a sampling rate of 16 kHz, a frequency 
usually used in speech applications. 





MI-8 • mlcraphgn«B 
I5I-J —Infrared sensors 



Fig. 1. Position of the sensors inside the apartment 
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We programmed the entire software which controls the acquisition under LabWin- 
dows/CVI of National Instruments. After digitalization the sound data is saved in real 
time on the hard disk of the host PC in a temporary file. The two computers are con- 
nected between them by a conventional IP network. The general outline of the acquisition 
system is presented in Figure 2. 
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Fig. 2. Diagram of telemonitoring system 



3 Acoustic Localization of the Person 

The exact geometrical location of the person in the apartment is important, as well to 
establish if some suspect noise registered indicates a distress or not, and also to indicate 
the location of the person to the emergency services in the event of an alarm. To locate 
the person, we analyze the information from the infra-red sensors and sound sensors. 
The information obtained through analysis of data from the two sources is necessary to 
guarantee a better characterization of the patient’s situation and to avoid false alarms. 

The 4 microphones Ml to M4, located in the small rooms of the apartment (hall, 
toilet, shower-room and kitchen) are sufficient to evaluate the location of the noise or 
speech source, by comparison of the noise level of each microphone. On the contrary, 
for the larger spaces (the living room and the bedroom), we use a mural acoustic antenna 
composed of 4 microphones forming a 320x 1 7 0mm rectangle . After evaluation of several 
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available types (square, T, parallelogram, cross), we chose for our acoustic antenna the 
shape which allows to identify most exactly the spatial position of the sound source for 
the needs of our application [7]. Using our antenna and a 16 kHz sampling rate, we 
obtain a 0.2m precision for each of the space coordinates x, y, z which is sufficient for 
our application. 



4 Corpus 

In order to validate our analysis strategies by data fusion, we recorded a multichannel 
corpus, presenting a simultaneous and synchronised recording of the 8 signals (WAV 
type files), geometrical coordinates of position, position information from the infrared 
sensors, and data from the sensors carried by the patient. Files are recorded according 
to the SAM norm used for speech corpus. 

The corpus is made of a set of 20 scripts reproducing a string of events, either 
voluntary actions of the patient inside the apartment, or unexpected events which could 
characterize an abnormal or distress situation. The voluntary actions are moves of the 
patient inside the apartment and everyday gestures presenting natural speech signals, 
completed by other usual sounds (radio, telephone ring, dish noise, door noise, etc. . . ). 
The unexpected events are abnormal noises (falling, glass breaking) and abnormal speech 
signals, like: moans, cries and helps. Every script is registered 4 times. For this first study, 
we limit the subjects number to 5 persons. We present two scripts written in XML: 

< Script no. 1 > 

< description >Normal situation(no alarm detected) < / description > 

< time > 0 < / time > 

< Position > Kitchen < / Position > 

< Action > Particular dish noises < / Action > 

< time > Event 1 < / time > 

< Position > Living room < / Position > 

< Action > Phone ring < /Action > 

< time > Event 2 < / time > 

< Action > Person moving from kitchen to living room < / Action > 

< time > Event 3 < / time > 

< Position > Living room < / Position > 

< Action > Person speaking < / Action > 

< / Script no. 1 > 

< Script no. 2 > 

< description > Distress situation (alarm detected) < / description > 

< time > 0 < / time > 

< Position > Bedroom < / Position > 

< Action > Person moving inside the room (gets out of bed) < / Action > 

< time > Event 1 < / time > 

< Action > Person moving from bedroom to living room < / Action > 

< time > Event 2 < / time > 

< Position > Living room < / Position > 
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< Action > Noise of a person falling < / Action > 

< time > Event 3 < / time > 

< Position > Living room Moans < / Action > 

< time > Event 4 < / time > 

< Position > Living room < / Position > 

< Action > Long silence < / Action > 

< ALARM > Alerting the emergency services < / ALARM > 

< / Script no. 2 > 

The first step before the recording of the scripts of the corpus is the calibration of the 
sound recording system. A 1 kHz rectangular signal is reproduced by a speaker situated 
to 1 meter from every microphone. The gains of every sound channel are then adjusted 
to obtain the same amplitude of the signal. After recording, the corpus is analysed: every 
event of the script (nature and time) is characterised, the speech is labelled following 
the usual procedures used for speech corpus. 

Eigure 3 shows the panel of the data acquisition software. The left part ("audio” part) 
of the panel gives the localisation in three dimensions of the person in the living room, 
three selected signals from the microphones and their corresponding energies. The right 
part ("infrared” part) of the panel shows a diagram of the apartment on which the lamp 
type indicators are superimposed, giving the position of the person as localized by the 
infra-red sensors. Below, it gives the evolution in time of the signals (binary type) from 
the same sensors. 

The speech signals energy was calculated using a 1 second average in order to obtain 
the location of the patient inside the room. These first results show that the location 
information, given by the microphone (energy signal), is similar to the information 
given by the infra-red sensors. Eor example, the signal energy recorded in the hall 
(Figure 3) shows several maximums corresponding to the infra-red sensor “Entry” (at 
the beginning, t « s), and to the two hall infrared peaks recorded at the moments 3.5s 
and 12.5s (approximately). We note the same correspondence between the energy of the 
signals and the infra-red sensor (time ranging between 5 and 11 s) in the shower. Thus, 
we have two kinds of additional and coherent information, we shall use in our analysis 
by data fusion. 

The corpus will be used in order to develop a module of sound signals characteri- 
sation. It has two goals: first, it has to establish if the audio signal is a speech signal or 
a noise, then it has to characterise the noises (everyday life noises, or abnormal noises, 
i.e. falling) and the speech sounds (normal, cry for help or moans). This module is about 
to be produced and validated. 

To distinguish between speech and noise we shall compare the speech signal which 
has periodical characteristics (such as the pitch), to the spectral characteristics (large 
spectrum and impulse characteristics) of the noises. The characterization of noises will be 
made by a noise recognition system, based on a classical HMM method [4] . Speech char- 
acterization (normal or stressed) will be inspired from speaker characterization methods 
[1], [6]. If sounds produced by the speaker do not match the normal speech, the system 
will consider that the speaker produces abnormal speech, such as: calls for help, cries 



or moans. 
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Fig. 3. The panel of data acquisition software 
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If speech is considered to be normal, a recognition system of Word Spotting type 
will allow to recognize about 20 words, chosen from specialized vocabulary of help calls 
(help, aie!, etc), in order to facilitate to the emergency services the decision to intervene. 
However the recognition system will not do continuous speech recognition with large 
vocabulary, as we consider that the semantic content of the phrases uttered by the speaker 
in normal situations concerns private life information and should not be recorded. 

5 Conclusions and Perspectives 

We presented a system of telemonitoring for an apartment equipped with audio and infra- 
red sensors. It is meant to replace video cameras as the patients are not very comfortable 
with them. The hardware of the system was set up and was validated. We started to 
record a corpus of scenarios. 

The first results enable us to locate the person in three dimensions in the living room 
or to indicate the room where the patient is located (following the energy of the speech 
signals). The current recordings also allowed us to validate the sensors and the audio 
acquisition system. Information from the audio sensors is completed by information 
from the infra-red sensors to ensure the best results. 

We shall exploit the corpus of everyday life situations and distress situations in order 
to continue our research. We shall study and validate a set of monitoring strategies, based 
on data fusion analysis mixing information from the activity sensors with information 
obtained by sound environment analysis. 

This study is a part of RESIDE-HIS, a project financed by IMAG Institute in collab- 
oration with TIMC laboratory. 
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Abstract. Our paper describes a two-pass recognition approach to the Czech 
speech recognition. We perform an automatic vocabulary adaptation after the first 
recognition pass, then we build a new language model using the adapted vocabulary 
and finally we run a second recognition pass. The aim of the vocabulary adaptation 
is to decrease the high OOV rate which is typical for the speech recognition of a 
highly inflectional languages such as Czech, Polish and olher Slavic languages. 
Using the vocabulary adaptation algorithm we manage to decrease the OOV rate 
by 17% and, more importantly, our word accuracy improved by 11%. Therefore 
this approach can be successfully used in lime uncritical recognition tasks. 



1 Introduction 

In the recent years, when the speech recognition of Czech was in its early stages, the 
main problem was the data sparseness. We didn’t have enough transcribed speech data 
to train the acoustic models and not enough text data for estimating the parameters of the 
language model. Nowadays, we have several speech corpora with more then 20 hours of 
transcribed speech and we can easily train very good acoustic models using them. We 
also have a large text corpus for the language model training. The problem now is in 
the decoder ability to handle large vocabularies. Since in the speech recognition every 
distinct string of letters represents a distinct word, the size of our vocabulary grows very 
rapidly due to the highly inflectional nature of the Czech language. For example, the text 
corpus used for our experiments contains over 650k distinct words. But state-of-the-art 
speech decoders can handle vocabularies of the size in the range from 20k to 70k words. 
When we use only the 20k most frequent words in our language model vocabulary, the 
out-of-vocabulary (OOV) rate on the test set selected from the speech corpus is around 
13%. It means that we cannot achieve better accuracy than 87% using such vocabulary, 
not to mention that every OOV word usually causes one or two additional recognition 
errors. But OOV rate computed on the same test set using the full (650k) vocabulary is 
only 0.36%. It means that almost all words from the test set appeared in the text training 
corpus, we just didn’t include them in the language model vocabulary because of the 
decoder limitations. Our paper deals with this Aforproblem using the idea of two-pass 
recognition - after the first recognition pass the vocabulary is automatically adapted and 
then a second pass is performed. 
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2 Speech and Text Corpora 

2.1 UWB SOI Corpus 

UWB_S01 [2] is a read-speech corpus consisting of speech from 100 speakers. Each 
speaker read a set of 150 sentences that were selected from the economic sections of 

3 Czech newspapers. Aforementioned set is divided into two subsets - 110 so-called 
“training” sentences and 40 adaptation sentences. The adaptation sentences are intended 
for the speaker adaptation and there is only 40 of them. It means that all speakers read 
the same adaptation sentences. On the other hand the “training” sentences are different 
for each speaker. 

The corpus is designed in such a way that it contains as many distinct triphones as 
possible. 

2.2 Text Corpus 

For the language modeling purposes we collected texts from the newspapers Lidove 
Noviny [5] spanning the period 1991 through 1995. The corpus contains approximately 
33 million tokens (650k distinct words). Text data had to go through several preprocessing 
stages in order to obtain clear and unambiguous data. 

3 Two-Pass Recognition 

3.1 First Recognition Pass 

The acoustic models were trained on the UWB_S01 speech corpus, using only the “train- 
ing” subset from each speaker, i.e. 1 1 000 sentences yielding about 20 hours of transcribed 
speech. Since we didn’t employ any speaker adaptation technique, we used the adapta- 
tion subset of the corpus as a test set. Acoustic training was carried out utilizing HTK, 
the hidden Markov model toolkit [6] . 

The language model vocabulary used for the first recognition pass contains 20k 
most frequent words from the Lidove Noviny corpus. A bigram language model with 
Katz discounting was estimated using the SRILM toolkit [3]. Then we performed a 
first recognition pass, in which we created word lattices for all test utterances. AT&T 
decoder [7] was used for this purpose. We also found best paths through the lattices in 
order to obtain the baseline recognition accuracy. The baseline results are summarized 
in the Table 1 . 

3.2 Vocabulary Adaptation 

The basic idea of the vocabulary adaptation is based on the assumption that our acoustic 
models are well-trained and reliable and that the major source of the recognition errors 
are the OOV (unknown) words. If this assumption is correct, the acoustic models replace 
the unknown word with a word which is most acoustically similar to it. In many cases 
such similar word has the same stem as the correct word and only its ending is different. 
So the main point of the following algorithm is to add words that have the same stem as 
the words from the lattice to the vocabulary, while keeping the original vocabulary size. 
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Table 1. First pass results. 



Vocabulary size 


OOV rate on the test set 


Test set perplexity 


Word Accuracy 


20k 


12.86% 


521.28 


28.93% 



Algorithm of the vocabulary adaptation: 

1. Extract words from all test set lattices and save them to a lattice vocabulary. 

2. Use the Czech morphological analyzer [4] for splitting all words from the lattice 
vocabulary into stems and endings and create the list of all stems that appeared in 
the lattices (lattice stem list). 

3. Adapt the original language model vocabulary used in the first recognition pass. That 
means replace words that didn’t show up in the lattices with words from the full text 
corpus vocabulary, which didn’t appear in the original language model vocabulary 
and their stems are in the lattice stem list. 

3.3 Second Recognition Pass 

The adapted vocabulary is used to build a new language model. This model is then used 
for a second recognition pass. Table 2 shows the results. 



Table 2. Second pass results. 



1 Vocabulary size 


OOV rate on the test set 


Test set perplexity 


Word Accuracy 


1 20k 


10.71% 


573.75 


32.14% 



As you can see from the table above, our assumptions were correct. We manage to 
reduce the OOV rate by 17%. The perplexity of the new language model is slightly higher, 
but that was to be expected, since we added less frequent words into the language model 
vocabulary. The most important measure in the speech recognition, the word accuracy, 
increased by 11%. 

It is a very good relative improvement, even though the absolute numbers aren’t so 
great. But they can be improved by tuning the language model scaling factor and the 
word insertion penalty. We could also expect even better improvement, if we would adapt 
the vocabulary for each test utterance separately instead of adapting it for the whole test 
set altogether. 

4 Conclusion and Future Work 

This paper described a two-pass approach to the automatic speech recognition of Czech. 
The algorithm of an automatic vocabulary adaptation proved to be effective and increased 
the word accuracy by 11%. Therefore this approach can be successfully used in time 
uncritical tasks, where the immediate response is not necessary. We would like to point 
out that all steps of the two-pass recognition can be performed automatically. 
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Fig. 1. Scheme of the vocabulary adaptation 



Moreover, the experiments conducted in connection with this paper showed that our 
acoustic models are really very reliable. It means that when the recognizer encounters 
an unknown word, it outputs a word or a sequence of words which looks like a complete 
nonsense in the surface (word) level, but in fact really closely acoustically matches the 
correct utterance. So a good idea for the future research would be to output the most 
probable word sequence from the recognizer not in the form of the word transcription, 
but in the form of the underlying phone transcription. Those sequences of phones could 
be then modified using some kind of confusion matrix and thus allowed to form new 
phonetic baseforms that didn’t show up in the language model pronunciation dictionary, 
but possibly appeared in the full pronunciation dictionary. Such approach should reduce 
the OOV rate even more than the vocabulary adaptation described in this paper. 
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Abstract. A problem of state-of-the-art TTS systems is that the produced speech 
did not always fits the actual speech task. An improvement can be achieved by 
considering speaking styles in synthetic speech. If the desired style deviates only 
slightly from the standard speech of the TTS system, as it is the case for different 
kinds of reading styles, it is proposed that the style can be simulated with adapted 
prosody. Therefore this investigation uses data driven algorithms for prosody gen- 
eration. After training style simulation is done by switching to the appropriate 
style parameter set. First experiments show that the resulting speech quality is 
increased by the style adapted prosody. 



1 Introduction 

State-of-the-art TTS systems generate speech of good intelligibility but still lack in 
naturalness. This limits the users acceptability of synthetic speech. One of the possible 
reasons is that the TTS system did not adapt its output to specific speech tasks, i.e. 
the communication situation, the information to exchange etc., as humans do. This 
problem leads to the consideration of speaking styles in TTS systems. Therefore this 
contribution deals with the question of the simulation of speaking styles in synthetic 
speech. First a definition of the term “Speaking Style” is given followed by reflections 
on the acoustic-phonetic characteristics of styles. Consequently a procedure for style 
simulation in state-of-the-art TTS with adapted prosody is proposed. Experiments and 
first results are given. 



2 Definition of Speaking Styles 

According to [10] the term “Speaking Style” can be defined as the way of oral or written 
expression in its specific application. Furthermore it should be said that a style in general 
describes a deviation from a standard and that every style has typical recurrent relatively 
constant characteristics which enable recognition and assignment. 

In the case of speech it is not finally decided that the standard is from which each style 
deviates. In the background of speech synthesis the standard could therefore practically 
assigned to that style the system is originally designed for. That may be read speech. 

Another definition can be found in [3]. From there it should be noticed that the 
characteristics of styles are speaker dependent, i.e. different speakers may use different 
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Fig. 1. The characteristics of speech, its generation and perception, is influenced hy a wide range 
of parameters. Some of them are given here. 



ways to express the same style. On the other hand listeners also vary in the perception 
of styles. Furthermore the style of speech can change within a conversation. 

The Choice of a speaking style and its specific characteristics in human speech are 
influenced by a wide range of parameters. Figure 1 summarizes some of them according 
to [2] and [3]. 



3 Acoustic-Phonetic Characteristics of Speaking Styles 

A number of investigations on the influence of speaking styles on the characteristics of 
speech have already been done. An overview is given in [3]. Since the input to a TTS 
system is commonly a given text only the phonetic characteristics are considered. The 
carriers of style are segmental and suprasegmental parameters. The most important are: 

Prosodic characteristics: 

- Pitch contour 

- Rhythm 

- Amplitude 

- Distribution and realization of prosodic boundaries 

- Distribution and length of pauses 

- Distribution and realization of accents 
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Segmental characteristics: 

- Position and movement of formants 

- Spectral tilt 

- Realization of bursts 

- Schwa elision 

It is notable that each parameter contains cues which indicate a specific style but 
the style itself is characterized by the specihcation of all parameters together and their 
interaction [7]. 

4 Proposed Method 

Synthesizing speech of good quality is generally a difficult task. A main improvement 
in the segmental quality was achieved by concatenating waveform units in opposition to 
using formant filters. The concatenation technology is currently improved be using large 
speech corpora or multiple alternative units corpora instead of small diphon or multiphon 
databases with no alternative units. Still an object of research is the generation of natural 
sounding prosody. 

In section 3 it was shown that speaking styles have an influence on segmental and 
suprasegmental characteristics of speech. The intensity of influence on each parameter is 
whereby dependent on the style, i.e. as more as a certain style deviates from the standard 
as more the characteristics of the resulting speech deviates from the characteristics of 
standard speech. Since the simulation of styles in TTS is borne in mind at the first 
stage more slightly deviations from standard speech are intended. Such slightly style 
deviations are different kinds of reading styles in opposition to rough style deviations 
like shouting styles. 

The hypothesis of this work is that slightly style deviations could be simulated by 
style adapted prosody. This hypothesis is supported by observations like in [7], there 
spontaneous and read speech are compared. It was found that it is possible to reverse 
the listeners judgment about the speech style of an utterance by exchanging the entire 
prosody (pitch, duration and energy) between recordings of two different speaking styles. 

To generate style adapted prosody data driven algorithms shall be used. A change 
in style can than be realized by a switch to another parameter set, i.e. the use of style 
specific data bases for prosody control. This also gives the advantage that well developed 
algorithms can be used again. Figure 2 shows the proposed method. The final aim is that 
with the style adapted prosody the listeners judgment about the resulting speech quality 
is improved since the prosody is appropriate to the text and therefor fits to the listeners 
expectations. 



5 Data Driven Algorithms 

For the generation of style adapted prosody data driven algorithms should be used as 
explained in section 4. Accordingly a multi-level data driven approach for the generation 
of the speech rhythm is used. Likewise the pitch contour generation is also controlled by 
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Fig. 2. A TTS system synthesizing different speaking styles by switching to different data bases. 



a data driven hybrid model. The contribution of this work is to combine this two models 
by applying them to the same data. The models are roughly described in the following 
sections. 

5.1 Rhythm Control with a Data Driven Multi-level Model 

To generate a style adapted speech rhythm a multi-level concept is used. This model 
subdivides the rhythm control into phrase, syllable and phoneme level. The duration 
generation at each level can alternatively be controlled either by rule-based statistical 
methods or by learning algorithms such as neural networks. 

At the upper level the duration of a prosodic phrase is calculated depending on the 
number of contained syllables and the type of the prosodic phrase. The duration of each 
syllable is influenced by the number of its phonemes and various phonetic attributes like 
accent and nucleus type. The length of each phoneme is at the lowest level adapted to 
the already given syllable duration according to Campbell’s elasticity hypothesis [1], 
i.e. a stretching factor (z-score) is iteratively calculated for a given syllable length and 
the means and standard deviations of the phonemes durations. The model is sketched in 
figure 3 and detailed described in [5]. 

5.2 Pitch Control with a Data Driven Hybrid Model 

To control the pitch contour generation a hybrid data driven rule-based model is used. Its 
core component is the command-response model proposed by FUJIS AKI [4] . The input 
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Fig. 3. A data driven multi-level model for rhythm control in a TTS system. 



parameters to this model are learned and adjusted with a neural network. The advantage 
of this concept is that the high output quality of a well-balanced rule-based model is 
combined with the flexibility of a learning algorithm. 

The input to the entire model is a linguistic-phonetic feature vector describing the 
processed text material. The neural network estimates the FUJISAKJ commands from 
that information and accordingly the command-response model calculates the resulting 
pitch contour. The model is shown in figure 4 and more detailed described in [6] . 



6 Speech Material 

The design of the used TTS system took place on a reading style of single sentences and 
short stories. The 443 sentences similar to the PhonDat 1 corpus [9] contain all phoneme 
combinations of the German language. 

As a different kind of reading style this contribution investigates an news reading 
style. As speech material newscasts recordings from the German radio station “Deutsch- 
landfunk” are used. This induce the problem of comparing two speakers in two different 
styles. The reason for this procedure is that reading newscasts by an unprofessional 
speaker in an anechoic chamber would not result in natural sounding radio news record- 
ings. 

The database with the newscast recordings is part of a German speech corpus com- 
piled by the Institute of Natural Language Processing at the University of Stuttgart [7]. 
It consists of 72 news stories red by a male speaker. The total recording time includes 48 
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Fig. 4. A hybrid neural network rule-based approach for pitch contour generation in a TTS system. 



minutes of speech covering 13151 syllables. The recordings were made at two different 
days. Recordings from the same day include repetitions of the same news story by the 
same speaker. 

The speech material was automatically segment labeled and partly manually cor- 
rected. Pitch contour analysis was done using the Fujisaki Model. The algorithm for the 
automatic extraction of the Fujisaki Modell parameters is described in [8]. 

The single sentences and the news stories corpora differ in many ways. Already the 
construction of the texts shows significant differences, e.g. the single sentences consist 
of 3 up to 24 words. Average value is 6 words per sentence. The news stories consist 
of 3 up to 32 words but average value is 16 words per sentence. More different is the 
content of the two text types and the communication situation while recording. 

Average syllable duration is 200 ms for the single sentences corpus and 210 ms 
for radio news corpus. This gives both styles an average speech ratio of 3 syllables per 
second. 

Of special interest is that some news stories are repeated by the same speaker. This 
recordings are used to test the within speaker variability. For phoneme durations an 
average correlation coefficient of 0.86 was found. 

The pitch range for single sentence corpus is 60 Hz to 210 Hz with average value at 
120 Hz. In the radio news corpus a pitch range of 50 Hz to 170 Hz is used with average 
value at 100 Hz. 



7 Tests and Results 



Under test is the quality of the TTS generated prosody and the resulting speech. Nu- 
merical tests investigate how well the calculated prosody the data of human produced 
prosody hts. Subjective tests evaluate the listeners judgment about the resulting speech 
quality. Hence speech synthesized with conventional generated prosody, style adapted 
prosody and prosody mapped from natural speech are compared in listening tests. 
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With the data driven hybrid model for pitch generation described in section 5.2 after 
training with the radio news data an RMSE of 18.6 Hz between the original and the 
model generated pitch contour is reached with sentences of the validation set. 

Experiments with the multi-level model for rhythm control show that the consistence 
between model generated phoneme durations and the durations of natural speech is 
increased if the model parameters are adjusted to the evaluated speech. Adjusting the 
average phoneme duration and zscore parameters at the lowest rhythm control level to 
the news reading style e.g. results in an increase of the duration correlation of 0.2. 

Eirst listening tests show that with optimally adapted prosody the MOS for synthetic 
speech is increased by 0.5 at a scale of 1 to 10 (worst to best). With a diphone synthesizer 
and mapped prosody an average MOS of 6.5 was achieved. 

8 Conclusion 

Eirst results give evidence to the hypothesis that the speech quality of TTS can be 
improved if the prosody generation is adapted to the style of the processed text material. 
This means a speaking styles can be simulated with adapted prosody if the style differs 
only slightly from the standard style of the used TTS system. Eurther work is required 
to test this hypothesis with data of other speaking styles. 
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Abstract. Compared to other utterance types, general and special questions are 
characterized by greater syntactic and semantic integrity, manifested phonetically 
in week FO fluctuations within the accent groups, lack of FO declination and 
absence of potential boundaries between syntactic units. Thus the utterance type 
seems to be of the factors which influence the size of the intonation unit. 



1 Introduction 

It is well known that that intonation means are used for the division of the sound stream 
into units of various length - utterances and utterance parts - syntagms, which from the 
intonational point of view form a unit. 

The definitions of units delimited by intonation varies from researcher to researcher, 
depending on the aspect which is thought of as primary - phonetic, semantic or syntactic 
or their combination. 

In the Russian tradition which follows L. Scherba’s definition, a syntagm is minimal 
intonation unit of speech flow, formed in the process of speech and characterized by a se- 
mantic unity [1]. Thus phonetic and semantic aspects are emphasized. Some definitions 
concentrate on the phonetic aspects (compare R Ladefoged’s term ”a tone-group”: a part 
of the sentence over which a particular pattern extends [2]) or M. Gordina’s definition 
of a syntagm as “a minimal unit of speech flow characterized by a particular intonation 
pattern” [3]. Intonation changes at the end of each syntagm (but not necessarily at the 
end) usually on the last stressed syllable - the tonic syllable - but not necessarily on the 
last - signal its boundary, thus allowing the intonation to realize its delimitative, sentence 
forming etc. functions. Melodic and temporal characteristics - pitch and duration - are 
considered primary in this respect. 

As follows from above, the length of the syntagm is regulated mostly by semantic 
reasons. The minimal syntagm is one syllable, little is known about the maximum length 
of the syntagm and factors which may influence it. 

The aim of this study ' was to find out if the utterance type can be one of the factors 
which somehow influence intonation phrasing. 



* The research is supported by the RFBR grant N 01-06-80188 “Phonetic properties of the Russian 
spontaneous speech”. 
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2 Material and Experiment Design 



24 questions - 12 Yes-No questions and 12 special questions - were recorded from 
7 speakers of Russian, thus 168 realizations were studied. These questions were com- 
posed in such a way, that the number of words and consequently, syllables in them, ranged 
from a minimal number of 2 words (4 syllables) to a maximum 14 words (46 syllables). 
All Wh-questions began with the word ’’why”; both Yes-No and special questions were 
of neutral nature. 

Intonation curves were obtained with the help of the EDS program, developed at the 
Department of Phonetics of Saint Petersburg University in cooperation with Saint Peters- 
burg University of Telecommunications. The program allows automatic segmentation 
of the input signal into FO periods with manual correction of mistakes of the automatic 
segmentation (the option, which some other programs, like CECIL for example, do not 
provide). 

The auditory analysis of the recorded material was performed by 19 subjects, thus 
3092 responses were obtained and analyzed. 



3 Special Questions 



For most special questions the focus was as expected, on the interrogatory word. The 
FO on the accented syllable was either falling or rising, in the latter case the FO drop 
immediately followed the rise on the accented syllable. 

With the growing number of words in the question, the intonation pattern did not 
vary very much: in the part following the nucleus, the FO remained on the same level or 
slightly dropped within the final accented syllable. It should be noted that FO fluctuations 
(if any at all) in the part following the nucleus were realized within a very narrow range 
- about 20 Hz. FO declination was either absent or realized on the final syllable as the 
final FO drop (see Figure 1 as an example). 




Fig. 1. FO changes in a Wh-question having 8 words (24 syllables). 
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Note also the absence of pauses which would be expected at the boundaries of 
grammatical constituents. 

Let us now consider how these sentences were perceived by the 19 listeners. The 
data (Figure 2 and Figure 3 ) show that the number of pauses - that is the number of 
intonation units - grows with a number of words, the critical number of words being 9. 
Two pauses appeared in a question having 13-14 words. The potential maximum number 
of pauses in these questions could be equal to the number of grammatical constituences 
plus a pause after the interrogatory word, that is 6; however, this pattern was not realized 
by any one speaker. We may tentatively conclude that the type of the utterance, that is a 
special question, could have some influence on the speakers’ realization and listeners’ 
decisions. 




number of words (syllables) in questions 



Fig. 2. The number of perceived pauses as a function of the number of words and syllables in 
Wh-questions of varying length. 




Fig. 3. Mean values of the number of perceived pauses as a function of the number of words and 
syllables. 
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4 General Questions 

A greater tendency towards integration of words in the utterance, its resistance against 
phrasing and “cancellation” of potential boundaries was found in the realization of 
general questions, in which the number of words ranged from 2 (4 syllables) - He’s 
gone? - to 13 (46) syllables (Did Vladimir Nikolayevich Ivanov and his brother Nick 
leave for Nizhny Novgorod yesterday night by the last train?) 

The speakers were asked to pronounce them in a very natural manner, every time 
placing the focus on the last word. 

Realization. The FO curves (Figure 4 and Figure 5) show how the pitch changed within 
the sentences with a different number of syllables. 

Common features for all realizations independent of their length are the following: 

- absence of declination; 

- subdued word accents in the part preceding the focus; 




Fig. 4. FO changes in a general question, having 7 words (21 syllables). 
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Fig. 5. FO changes in a general question, having 8 words (25 syllables). 
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- absence of melodic changes at potential boundaries of syntactic units; 

- the tendency to avoid pausing, even at the boundaries of syntactic constituences; 

- in some cases, slightly falling (not rising) tone, in intonation units considered as 
non-final. 

The fluctuations of the FO within the accent groups in the part of the utterances 
preceding the focus do not exceed 20-30 Hz. 

The results of the auditory analysis are presented in Figure 6 and Figure 7. 
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number ofwords (syllables) in questions 



Fig. 6. The number of perceived pauses as a function of the number of words and syllables in 
general questions of varying length. 




Fig. 7. Mean values of perceived pauses as a function of the number of words in a general question. 



The 6 words appear to be the critical number here, when the listeners set apart the 
noun phrase (the subject) from the rest of the sentence. The listeners are more unanimous 
though in their decisions with sentences having 13-14 words (45 syllables). Here, again 
the potential maximum number of boundaries between intonation units (5) was not 
realized, leaving only 1 pause as the only necessary. 
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5 Conclusion 

There are utterances which resist structuring. In the case of special and general questions 
the potential number of pauses signaling the end of one unit and the beginning of the 
other is not generally realized, due to a greater integration of words in these utterance 
types compared to, say, statements, as the speaker preplans the intonation of the whole 
utterance. Phonetically, this integration is shown by absence of declination, lack of word 
accents, or subdued word accents, realized on the stressed syllables in the part preceding 
(for general questions) or following the focus (for special questions), which provides a 
background for a greater melodic contrast on the focus syllable, and also by a greater 
tempo and lack of final lengthening. 

It’s been generally agreed that there is no syntactic unit that exactly corresponds 
to a syntagm (intonation unit). When speaking slowly, a speaker may choose to break 
a sentence up into a large number of units. The way in which a speaker breaks up 
a sentence depends largely on what that person considers to be the important information 
points in the sentence. In questions there is one and only one focus of information. A basic 
intonation unit is thus a unit of information rather than a syntactically defined unit. The 
results of the study presented here speak in favor of the influence of the utterance type on 
the intonation phrasing. Thus, questions (both general and special) show greater semantic 
and syntactic integrity of words, manifested phonetically in weak FO fluctuations within 
the accent groups and at potential boundaries, and lack of FO declination; the growing 
number of words in questions have very little influence on the intonation phrasing, 
showing the speaker’s strategy to preplan the intonation of the whole utterance which 
he considers as one information unit. 
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Abstract. This contribution addresses the application of Bayesian changepoint 
detectors (BCD) for the estimation of boundary location between speech units. 
A novel segmentation approach based on the family of Bayesian detectors us- 
ing an instantaneous envelope and instantaneous frequency of speech rather than 
waveform itself is suggested. This approach does not rely on phonetic models, and 
therefore no supervised training is needed. No apriori information about speech is 
required, and thus the approach belongs to the class of blind segmentation meth- 
ods. Due to the small percent of error in signal changepoint location, this method 
can be also used for tuning boundary location between phonetic categories esti- 
mated by other segmentation methods. The average bias between exact boundary 
location and its estimation is up to 7 ms for real speech. 



1 Introduction 

The problem of detecting and estimating the location of speech discontinuities (change- 
points) has been intensively studied for several decades. A great many methods using 
different characteristics for manual and blind segmentation have been developed, e.g., 
[1], [2], [3], [4]. Among the most widely used segmentation methods for automatic 
speech segmentation belong the method based on hidden Markov models (HMM) [5], 
though the most promising method seems to be based on the combination of the Bayesian 
approach with a HMM method [6], or the combination of the Bayesian approach with 
rules [7], [8], and discrimination analysis [9]. This contribution deals with the possi- 
bility to use a BCD by itself, without the need of combination with any other method. 
Basic BCD characteristics as well as implementation considerations are discussed. The 
results for manual and automatic speech segmentation are summarized and evaluated. 
The BCD can be used for the improvement of speech corpora labels as well as for precise 
localization of various phonetic categories. The main motivation for this work was the 
text- to- speech (TTS) inventory acquisition. 

This work originates from the study [ 1 8] in which it was shown that the autoregressive 
BCD (BACD) [10] can be used for abrupt changes in speech. On the other hand this 
work revealed numerical instabilities and the need of rather complicated logical filtration. 
Therefore on the base of this study, following conditions for system derivation can be 
summarized as follows. 
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1. Exclusion of extensive model training, which is required if HMM or neural nets are 
used. 

2. Single changepoint analysis should be performed rather than multiple changepoint 
analysis. 

3. Simple models for speech signal should be used to ensure the numerical stability of 
computations. 

The general piecewise linear model [10], [11] was applied in the construction of 
the entire system for speech segmentation. The first condition excludes HMM or neural 
net-based systems for speech segmentation. The second condition leads to the need for 
signal segmentation, as speech contains many changepoints. If these changepoints have 
to be located using single changepoint analysis, the speech must be segmented, and 
changepoint analysis using BCD then repeated for each segment. The consequence of 
this approach results in many candidates for one changepoint. Therefore the need for 
logical filtration arises to reach the final estimate of one changepoint localization. The 
third condition requires the proper choice of BCD type. This will be discussed in the 
next section. 



2 Bayesian Changepoint Detectors 



Let us assume that one segment of the speech signal can be modeled by the linear 
piecewise model [10] 



d[n] 



' M 

'^akgk[n] + e[n], 

k=l 
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for n < m 



for n > m 



V k=l 

where gk [n] is the value of a time-dependent model function. 
A matrix formulation of the signal model (1) can be employed 



d = Gb -f e, 



( 1 ) 



( 2 ) 



where d is aiV, 1 data vector x[2], ..., x[A^]]^, e is a [N, 1] vector of Gaussian noise 

samples, and G is a [N, M] matrix whose columns are the basic functions evaluated at 
each point in time series, and b is a [N, 1] vector with coefficients ak, bk- The matrix 
form enables the treatment of all linear (in the parameters) models in exactly the same 
ways. 

Marginalizing the likelihood function of e[n], excluding nuisance parameters ak, bk 
and maximizing the likelihood function leads to the posterior density for changepoint 
m as [10] 



p({m}|d) 



[dTd-dTG(GTG)-iGTd 

Vdet(GTG) 



-(N-M) 

2 



( 3 ) 



The resulting posterior density function is searched for the maximum. This maximum 
determines the position of the changepoint in the given signal segment, a procedure 
known as the maximum aposteriory (MAP) estimation of the changepoint location. 
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Now let us focus on the matrix based detectors and their proper segment models. 

BACD Bayesian Autoregressive Changepoint Detector; segment is modeled by autore- 
gressive (AR) processes determined by two sets of parameters ak, bk (see eq. (2) 
and [11]). 

BSCD Bayesian Step Changepoint Detector; segment is modeled by two constants uq, 
bo, and described by equation d = Gsbs + e. 

BLCD Bayesian Linear Changepoint Detector; segment is modeled by two parameters 
of a linear function uq, ai, 6o, 6i, and described by equation d = GlBl + e. 

BDCD Bayesian Difference Changepoint Detector (BACD of the 1-st order); segment 
is modeled by two parameters oi, bi, and described by equation d = GdBd + e. 

Matrices Gs, Gl, and Gd are defined by corresponding model as follows: 



■f o' 




1 0 


, Gl = 


0 1 


_0 1 _ 





Gs = 



and Bd have the form: 



1 t[l] 0 0 



1 t[m - 1] 0 0 
0 0 1 t[m] 



[0 0 lt[N]j 



, Gd 



x[0] 0 

x[m — 1] 0 

0 x[m] 

0 a;[7V-l]J 



bs = [oo,(>o]^5 Bl = [oo, oi, & 0 ) and bD = [ai,6i]^. 



Let ns summarize conditions for all models. In all cases the result is one position of a 
greatest change (given by the MAP) for one segment. We will focus especially on BSCD, 
BLCD, BDCD detectors which are the bases of the segmentation approaches suggested 
in this paper. The BACD will serve as the reference for the comparison of results. 

The model of segment for the BACD is composed of two different AR processes with 
two different orders. When the orders of both AR processes are precisely determined 
then this type of detector gives the best results. Typical characteristics of BACD can be 
summarized as follows. The BACD with higher AR model orders (about 10) is computa- 
tionally extensive and can be numerically unstable, especially for longer segments (about 
500 samples for sampling frequency /« = 8 kHz) [12]. The main problem of BACD is 
the sensitivity to the AR model order selection [13], [14], [15], [16] and [17]. These facts 
lead to relatively great standard deviation in an estimation of changepoint location [18]. 
Another important property of the BACD is its sensitivity to spectral changes rather 
than to energy changes. Thus the BACD estimates precisely transitions between vowels 
rather than transitions between vowels-voiced fricative, vowels-voiced occlusion and 
semivowel-vowel. 

The model of segment for the BSCD is composed of two different constants in 
a noise. Thus the BSCD requires a signal to be modeled by jumps, and thus it is very 
sensitive to changes in root mean squares (RMS) in signal. Several measures for RMS 
were tested. Finally, the instantaneous envelope computed by the Hilbert transform [20] 
was used as the input for the BSCD. This solution is a very effective and yields the 
highest detector sensitivity and precise localization of changes in a speech signal. 
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The model of segment for the BLCD is composed of two different linear functions 
in a noise. With regards to the previous choice of signal parameter for the BSCD (that is 
the instantaneous envelope), the cumulative sum of instantaneous frequency of a signal 
was used as the input for the BLCD. This type of detector is then sensitive to frequency 
changes rather than amplitude or RMS changes. 

The model of segment for the BDCD is composed of two different autoregressive 
processes of 1-st orders. The input for this detector is directly a speech waveform. This 
type of detector is (like the BACD and BLCD) sensitive to frequency changes. 



3 Implementation Considerations 

All three detectors (BSCD, BDCD, BLCD) do not require any model order estimation, 
they are numerically stable, and they are not sensitive to a segment length. Therefore 
there is no critical choice of segment length and overlap given hy the method itself. 
Only speech characteristics can be used to determine the segment length and overlap. 
This enables the more precise tuning of system parameters to the characteristics of 
speech, and the achievement of better results (smaller standard deviation). The speech 
signal is processed by the BCD in signal segments of the lengths 100 ms (800 samples, 
/s = 8 kHz) with overlap 10 ms (80 samples, /s = 8 kHz). The segment length is 
determined by the compromise between the lengths of the shortest phonemes (certain 
Czech fricatives) and the longest phonemes (vowels). The minimum of segment overlap 
is given by the speech interval of stationarity. For each segment one candidate is generated 
for one possible changepoint. Due to the segment overlap, there are many candidates 
for one signal changepoint. These candidates are either taken into account or discarded 
according the following two-steps algorithm of logical filtration. First, all candidates are 
summed over the sliding window (the length of this sliding window is 1 ms for BSCD, 
5 ms for BLCD and BDCD). The final sum representing the candidate belonging to 
one signal changepoint is placed at the begining of the sliding window. Second, if the 
distance of neighboring candidates is less than 10 ms then only the first candidate is 
taken into account. 



4 Results 

Extensive experiments with simulated and real speech signals were performed to veri-fy 
the suggested segmentation approaches. One example of gained results for all tested types 
of detectors is given in Table 1 , where standard deviations (STD) are shown for various 
phonetic categories. The STD is given in samples for sampling frequency /^ = 8 kHz. 
Synthetic signals for this case were generated using AR(3) process with parameters 
extracted from real speech. One hundred realisations (of the length 400 samples) were 
used for the simulation of each boundary type. 

The results of the experiments can be summarized as follows. BSCD shows high 
sensitivity to dynamic changes in speech and can be used for spotting almost all bound- 
aries between phonetic categories except a few types of boundaries, e.g. boundaries 
between vowels and vowel-voiced fricative. The BSCD sensitivity to discontinuities in 
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Table 1. Some boundary simulations for testing 



Boundary type 


deep [dB] 


BSCD 


BLCD 


BDCD 


BACD 


STD 


STD 


STD 


STD 


Vowel-Voiced Fricative 


1.79 


63 


38 


190 


85 


Vowel-Voiced Occlussion 


2.24 


56 


50 


155 


30 


Vowel-Vowel 


2.25 


63 


59 


193 


99 


Semivowel- Vowel 


2.53 


56 


33 


193 


55 


Nasal- Vowel 


2.83 


38 


29 


144 


18 


Silence-Vowel 


5.79 


53 


27 


98 


5 


Burst- Vowel 


6.58 


21 


19 


9 


5 


Burst-Semivowel 


9.02 


12 


19 


10 


3 


Africate- Vowel 


13.14 


30 


6.8 


1.9 


1.7 


Voiceless Fricative- Vowel 


13.27 


21 


6.7 


1.5 


1.2 



speech dynamic changes is greater than the sensitivity of BACD. If the cepstral differ- 
ence [19] between two neighboring segments is less than 6 dB, the BACD fails to locate 
the changepoint properly [12], a crucial problem when segment length and overlap are 
not properly chosen. The SBCD sensitivity to a segment length and overlap is less than 
BACD sensitivity, therefore SBCD results have a lower bias and standard deviation of 
estimated changepoint locations than BACD results. The BSCD works properly even 
for the cepstral difference less than 6 dB. 

SBCD results in the location of specific types of boundaries also serving better 
than for HMM segmentation, which holds especially for boundaries between silence- 
burst, burst-vowel, burst-semivowel, that is, for phonetic couples with relatively great 
dynamic changes. The BCD detectors were tested with the aim of correcting the boundary 
locations between phonetic units estimated by other methods as well as for automatic 
speech segmentation. The bias between exact boundary location and its estimation by 
the BSCD and BLCD detectors is less than 7 ms (60 samples for 8000 Hz) for speech 
signal. The average probability of properly determined phonetic boundaries in automatic 
segmentation is determined by the occurrence of phoneme types detectable by BSCD. 

The illustration of real speech segmentation using the algorithm described in the 
preceeding section is illustrated in Figure 1. The first part of the figure shows the speech 
waveform together with boundaries got by manual segmentation. The third and fourth 
parts depict boundaries of BSCD and BLCD segmentation, respectively (the y-axis 
represents the number of occurrences for each boundary candidate, the maximum is 
10 occurences). It can be seen that the BSCD fails inside the stationary part (its length 
is greater than 100 ms) of the vowel ”a”. In the contrary of the manual segmentation 
the BSCD and BLCD yield the abrupt spectral changes inside the consonant “c”. Other 
changepoints very closely correspond to the spectrogram (see the second part of the 
figure). The BLCD is sensitive to frequency changes, therefore it gives more boundaries 
than expected for the given number of phonemes. That means the BLCD should not be 
recommended for an automatic segmentation. While the BSCD boundaries correspond to 
RMS changes of speech the BLCD boundaries correspond to spectrogram changes. Thus 
these two types of detectors yield complementary information about speech changes. 
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Fig. 1. Example of blind BSCD and BLCD segmentation. 



The final performance comparison of all described types of detectors is given in 
Table 2. The numbers represent the order of the acceptability for the parameter given in 
the first column (1 - the best, 4 - the worst). 



Table 2. Performance comparison of BCD detectors 



Parameter 


BSCD 


BLCD 


BDCD 


BACD 


Universality 


1 


2 


3 


4 


Computational costs 


2 


3 


1 


4 


Numerical stability 


1 


3 


2 


4 


Estimation consistence 


1 


3 


2 


4 


Reliability below deep = 6 dB 


2 


1 


4 


3 


Reliability above deep = 6 dB 


4 


3 


2 


1 



5 Conclusions 

A novel approach based on signal parameterization followed by different types of BCD is 
suggested. This approach was verified and optimized by extensive experimentation. Due 
to the small margin of error in signal changepoint location, this method can be used for 
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precise boundary location between phonetic categories estimated by other segmentation 
methods. This result is the most important. This approach is generalized to yield a base 
for the development of new combinations of various types of BCD with proper speech 
parameterization. 
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Abstract. Filter bank approach is commonly used in feature extraction phase 
of speech recognition (e.g. Mel frequency cepstral coefficients). Filter bank is 
applied for modification of magnitude spectmm according to physiological and 
psychological findings. However, since mechanism of human auditory system is 
not fully understood, the optimal filter bank parameters are not known. This work 
presents a method where the filter bank, optimized for discriminability between 
phonemes, is derived directly from phonetically labeled speech data using Linear 
Discriminant Analysis. This work can be seen as another proof of the fact that 
incorporation of psychoacoustic findings into feature extraction can lead to better 
recognition performance. 



1 Introduction 

Feature extraction is an important part of speech recognition process where input wave- 
form is processed for the following pattern classification. While classification is usually 
based on stochastic approaches where models are trained on data, feature extraction is 
generally based on knowledge and beliefs. Current methods of feature extraction are 
mostly based on short term Fourier spectrum and its changes in the time. Auditory-like 
modifications inspired by physiological and psychological findings are performed on 
spectra of each speech frame in the sequence. Mel frequency cepstral coefficients [2] are 
commonly used as feature extraction method where energies in spectrum are integrated 
by a set of band limited triangular weighting functions (filter bank). These weighting 
functions are equidistantly distributed over mel scale according to psycho-acoustic find- 
ings where better resolution in spectrum is preserved for lower frequencies than for 
higher frequencies. The log of integrated spectral energies is taken (which corresponds 
to human perception of loudness) and finally a projection to cosine bases is performed. 
However, since mechanism of human auditory system is not fully understood, the opti- 
mal system for feature extraction is not known. Moreover, psychoacoustic findings often 
describe limitations of human auditory system and we do not know if modeling of those 
limitations is useful for speech recognition. 
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This work presents a method where the filter bank is derived directly from phoneti- 
cally labeled speech data. We can obtain both, frequency warping and shape of individual 
weighting function of filter bank as result of this method. 

2 Linear Discriminant Analysis 

The method is based on Linear Discriminant Analysis (LDA) proposed by Hunt [3] . LDA 
is a technique looking for such linear transform which allows dimension reduction of 
input data. However, it preserves information important for linear discrimination among 
input vectors which belong to different classes. The output of LDA is a set of linear 
independent vectors which are bases of a linear transform and which are sorted by their 
importance for discrimination among different classes. Since we have also information 
about importance of particular base vectors, we can pick up only several first basis which 
preserve almost all the variability in the data important for the discriminability. In other 
words, the resulting transformation matrix contains only several first columns of matrix 
obtained by LDA. 




The Figure 1 demonstrates effect of LDA for 2-dimensional data vectors which 
belong to two classes. The grey and the empty ellipses represent distributions of data 
of two different classes Ci and C 2 with mean vectors mi and m 2 . The axes X and 
Y are coordinates of the original space. Large overlap of the class distributions can be 
seen in both directions of these original coordinates. The axis Z then shows the direction 
obtained by LDA. The classes are well separated after their projection into this direction. 
Since this example deals just with two classes and since LDA assumes that distributions 
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of all classes are Gaussian with the same covariance matrix, no other direction can be 
obtained for better discrimination. 

Base vectors of LDA transforms are given by the eigen vectors of a matrix x S^c- 

The within-class covariance matrix represents unwanted variability in data and it 
is computed as the weighted mean of covariance matrices of classes: = E[Ep], 

where Up is covariance matrix of particular class. The across-class covariance matrix 
Hac represents the wanted variability in data and it is computed as an estimation of 
covariance matrices for mean vectors of classes. = E[{fip — fJ,){iJ,p — where 
fip is mean vector for particular class and /r is global mean vector. 

An eigen value associated with one eigen vector represents the amount of variability 
(necessary for the discriminability) preserved by the projection of input vectors to this 
particular eigen vector, a dimension reduction. If LDA is to be used for dimension 
reduction, only several eigen vectors corresponding to the highest eigen values can be 
used. 

3 Filter Bank Derived from Data 

Filter bank is derived directly from phonetically labeled speech data using LDA described 
in previous section. In this case the magnitude Fourier spectra of all training data frames 
are directly used for computation of across-class and within-class covariance matrices. In 
our speech recognition task, we want to distinguish between different phonemes. Spectra 
representing speech frames labeled by the same phoneme belong to one class. Examples 
of across-class covariance and within-class covariance matrices derived this way from 
speech data from TIMIT database are shown in Figure 2. Half of symmetric magnitude 
spectrum (129 points) was used as vectors for deriving these covariance matrices. 

The Figure 3 shows first 5 LDA spectral bases given by the eigen vectors of the 
matrix x Sac- The eigen values in Figure 3a indicate that almost all variability in 
data important for class separability is preserved by the projection to only several first 
base vectors. The linear transform can be performed by the multiplication of an input 
vector and a matrix M, where columns are the base vectors. In our case, we choose only 
13 first base vectors, so the transform matrix M has 129 rows and 13 columns. 

3.1 Smoothing of Speech Spectra 

The projection of magnitude spectrum of one speech frame into these selected basis 
results in new vector (13 points) which should contain almost the same information 
for correct recognition as the original spectrum. Since the base vectors are linear in- 
dependent, it is possible to obtain another transform which projects the reduced vector 
back into the original space - spectrum (129 points long). This transform is given by 
the inverse transform matrix M~^. We will obtain a final transform by joining (mul- 
tiplying) both mentioned matrices M x M“^. This transform projects the magnitude 
spectrum into its smoothed version where the information useless for discriminability 
among phonemes is removed. Each column of the final transformation matrix represents 
a weighting function for integrating band of frequencies around the point corresponding 
to the index of given column. Every 5-th of these weighting functions are shown in Fig- 
ure 4a. The resulting weighting functions for integration of lower frequencies are very 
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Fig. 2. Across-class and within-class covariance matrix computed from magnitude spectrum 
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Fig. 3. Basis derived using LDA from magnitude spectrum 
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Fig. 4. Filter bank and warping derived using LDA 



narrow (integrating only several points of spectra and preserving more details), while 
functions integrating higher frequencies are much wider. This fact corresponds also with 
psychoacoustic findings about human frequency resolution. 

3.2 Deriving of Filter Bank 

It is also possible to derive frequency warping by measuring and integrating bandwidths 
(widths) of consequent weighting functions (Figures 4b and 4c). The smoothed spectrum 
can be represented by selecting only some of its samples without loosing any information. 
It means that we can pick up only several weighting functions and perform projection 
of original spectrum into them. Their selection must be done according to the warping 
derived. This way we end up with a set of weighting functions which are very similar to 
commonly used Mel filter bank (Figure 4d). 

4 Limitations of the Method and Conclusions 

Our experience shows that recognizers based on feature extraction inspired by psychoa- 
coustic findings about nonuniform human resolution in frequencies can perform better 
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than those based on pure short term Fourier spectrum. This work can be seen as another 
proof of the fact that incorporation of those psychoacoustic hndings into feature ex- 
traction leads to better separability among phonemes in low dimensional feature space 
and also to better recognition performance. However, the LDA technique expects that 
data which belong to individual classes have the same Gaussian distribution and that 
also mean values of classes obey a Gaussian distribution. Of course this is not true for 
magnitude spectra of speech. The quest for optimal filter bank for speech recognition is 
therefore still open. 
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Abstract. The aim of our effort is to reach higher quality of resulting speech 
coded by very low bit rate (VLBR) segmental coder. In already existing VLBR 
coder [1], we want to improve the determination of acoustical units. Furthermore, 
better analysis-synthesis technique for the synthesis part (Harmonic-Noise Model) 
instead of LPCC is going to be used. The VLBR coder consists of a recognition 
system followed by a speech synthesizer. The recognizer identifies recognition 
acoustic units (RU). On the other hand, the synthesizer concatenates synthesis 
acoustic units (SU). However, the two kinds of acoustic unit can be identical or 
different and then can be modeled in different ways such Hidden Markov Model 
for the RU and Harmonic-Noise model for the SU. Both kinds of units are obtained 
automatically from a training database of raw speech that does not contain any 
transcription. In the original version of the coder [1], the quality of the synthetic 
speech was not sufficient for these two main reasons: the SU units were too short 
and difficult to concatenate and the synthesis was done using basic LPCC analysis- 
synthesis. In order to remove first drawback, three methods of re-segmentation 
were used. Afterwards, the basic LPCC analysis-synthesis was replaced by HNM. 



1 Introduction 

When we speak of very low bit rate coders, segmental or phonetic vocoders are meant [4] . 
Only those vocoders based on recognition and synthesis are able to efficiently limit the 
bit rate. The coder and the decoder share the database of speech units (segments) that 
are considered to be representatives of any speech uttered by any speaker. Only the 
indices of representatives and some prosodic information are transmitted by this coder. 
Hence, the bit rate of these types of coders can be less than 350 bps. The quality of 
this speech coding approach depends on many factors. Among the most important is 
the quality of recognition of speech units. But speech analysis and synthesis are not less 
significant. The definition of speech units influences resulting quality of the coder. In our 
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experiments, speech units are found automatically (by Automatic Language Indepen- 
dent Speech Processing (ALISP) tools) before training of recognizer. The fact that we 
do not need transcribed and labelled speech database is a great benefit of this method. 
The coder can be easily used in languages lacking standard speech databases. When 
a set of speech units is obtained, they can be used for coding. The coder consists of 
recognizer acoustically labelling the speech and of additional information encoder. In 
the decoder, synthesis built on concatenation of examples from the training corpus is 
applied in order to obtain output speech. A technique based on applying automatically 
derived speech units was developed at ENST, ESIEE and VUT-Brno [1], [2]. However, 
the quality of synthesized speech is not sufficient. This paper reports experiments based 
on re-segmentation of original units. The aim of the re-segmentation is removing tran- 
sition noise from the synthesized speech, which is caused by concatenation of chosen 
representatives in the decoder. 

2 Basic Structure of the Coder 

All our experiments are built on Boston University Radio Speech Corpus [5] database 
(DB) collected in 1995. Use of ALISP units for very low bit rate speech coding is in 
more details described in [1], [2]. Eor the initial segmentation Temporal decomposition 
(TD) [3] is applied. Created segments are clustered by Vector Quantization (VQ). 

Hidden Markov Models (HMMs) [1] are widely used in speech recognition because 
of their acoustic modeling capabilities. HMMs were applied only in our first two exper- 
iments of the re-segmentation. HMMs are related to original VQ symbols, so that their 
number is 64. The number of emitting states per model is fixed fo 3. The models are 
initialized as left-right without state skipping. We have found that an iterative approach 
can improve the acoustical quality of units. Hence, several generations of models are 
created. 

Units found using HMM or TD-tVQ segmentation are referred as “original” or 
“short” units. In the baseline version of the coder, a limited number of representatives is 
found in the training data for each unit. In the coding of unseen speech, the input signal is 
labelled by HMM or TD-tV Q, and the optimal representative is selected for each detected 
unit. The information about units as well as about representatives is transmitted to the 
decoder, where a concatenation synthesis takes place. During this synthesis, a transition 
noise can appear in points where representatives were concatenated. 

3 New Units 

As mentioned before, the re-segmentation of the original units recognized by HMMs 
or VQ is applied in order to decrease the influence of transition noise on the resulting 
synthesized speech. Original TD segments, on which VQ or HMMs are afterwards 
trained, contain stable parts of speech in their centers. Therefore, the boundaries of 
these segments are set to non- stable parts of speech that mostly contain small energy of 
signal. Hence, in the decoder (where chosen appropriate representatives are concatenated 
to create resulting speech) these representatives are concatenated mostly in parts with 
small energy, so that the signal-to-noise ratio is low. One can say that instead of the 
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Fig. 1. Scheme of the whole training process based on ALISP. 



re-segmentation of original units, new alternative segmentation could have been done at 
the beginning of our job. However, the aim is not only creating units, whose boundaries 
are set to the stable parts of speech signal. New longer units that cover more non-stable 
parts of signal are required. It would be difficult to create them by TD and to train VQ 
codebook or HMMs afterwards. Hence, the re-segmentation of original units is done 
after VQ or HMM recognition. 



3.1 Re-segmentation According to Middle Frames of Original Units 

In this approach, the boundaries of new units are placed to the centers of original ones, as 
can be seen in Figure 2. Several experiments were done that differ in minimal length of 
new units. The minimal length represents the minimal number of frames in created new 
units. The algorithm of the re- segmentation is: First, the centers of old units are found. 
Then, we move from one center to another and remember the number of frames we 
went over. If number of frames between two neighboring centers is less than required, 
the second center is not declared as new segment boundary and we move to another old 
unit’s center. This process is iterated unless we go over required minimal number of 
frames. It is obvious that the re-segmentation starts from the first center of first original 
unit. The “prefix” part of first original unit is declared as an independent new unit. The 
same problem appears in the last processed old unit. The names of the whole new units 
consist of the names of original units that are covered by new one. Let suppose that the 
original label sequence is: Hs HF H7 Hr Hi ... and a new segment boundary is going to be 
fixed into center of H7 segment. After the re-segmentation according to this approach, 
the label sequence will be: HsHFH7 H7HrHi ... 

3.2 Re-segmentation According to Middle Frames of Middle States of HMMs 

In this approach, new segment boundaries are represented by center frames of the middle 
HMM states of old original units. The number of emitting states per one HMM is fixed 
to 3. Each state must contain one frame, at least. Hence, the minimal number of frames 
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a) 




+s 



Fig. 2. Example of re-segmentation according to middle frames of original units. Minimal length 
of new units is 4 frames: a) speech signal with its splitting into the frames, b) original segmentation 
recognized by HMMs. c) new re-segmentation. 

in an original unit is 3, as well. If the number is higher, the frames are assigned to states 
according to likelihood scores. It is obvious that the resulting segmentation based on 
this approach will be different from the hrst one. 

3.3 Re-segmentation According to Gravity Centers of Original TD-Based Units 

In the last experiment, the segment boundaries are supposed to stand in gravity centers 
of original segments, derived by TD. The goal is that gravity centers are one of the 
most suitable positions in segments, where the spectrally stable parts of speech can 
be expected. The re-segmentation can not be built on label sequence recognized by 
HMM, because this sequence does not match with label sequence obtained by TD. 
Hence, the re-estimation by HMMs is not used. However, as before, not each gravity 
center of original segment will represent the new segment boundary. In the sequence of 
interpolation functions (IPs), their width in frames is determined. Only a gravity center 
which lies in IF that is wide enough can represent a segment boundary. The sufficient 
width of IPs is evaluated according to an a-priori chosen constant. In sequence of IPs, 
several consecutive narrower IPs than required can appear. Afterwards, when applying 
the previous condition only, the distances in frames between new segment boundaries 
would be too important, and new units would be too long. Therefore, each new unit 
is constrained in length so that it can cover less original units than another a-priori set 
constant. Due to, we can easily control lengths of new units. The example is given in 
Figure 3. 

4 Representatives for Synthesis 

In our experiments, LPCC and Harmonic-Noise model (HNM) analyzes-synthesis al- 
gorithms were applied. To complete the coder, we need to define the synthesis units that 
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Fig. 3. Illustration of Temporal decomposition and re- segmentation process according to gravity 
centers of original units on chosen part of speech: a) speech signal b) interpolation functions (solid 
lines) with positions of gravity centers of each segment (dotted lines) and segment boundaries of 
original units (dashed lines) c) original segmentation and new re-segmentation of speech. 



will be used in the decoder to synthesize the resulting speech. For each unique dictionary 
unit, the three longest units from the training data set are kept, so that they are mostly 
down-sampled when being converted to shorter segments. Obviously, the attention is 
already paid to the training units after the re-segmentation. When coding a previously 
unseen speech, first the coding units are detected using the HMM recognizer (in first two 
methods of re-segmentation) or by TD-tVQ (in the third method of re-segmentation). 
Then, the stream of recognized units is re-segmented. For each coding unit, the best 
synthesis unit (from 3 representatives) is chosen. The choice is done using minimum 
Dynamic Time Warping (DTW) distance between a representative and an input speech 
segment. When selecting the representatives to synthesize a previously unseen speech, 
we can easily find out that in the coded speech, there are some coding units that do 
not have equivalent representative stored in DB of representatives (based on training 
data set). It is caused by the re-segmentation of original units, because the theoretical 
number of unique units after the re-segmentation is infinite. A new long unit created by 
some of the re-segmentation method can consist of two, three, or more original units, 
depending on the minimal required length. Hence, many re-segmented coding units can 
appear that have not been seen in training data set and for which we do not have any 
appropriate representative. Therefore, two approaches were developed in order to solve 
this problem. 



4.1 Seeking the Best Synthesis Unit from Existing Ones 

Instead of non-existing synthesis unit, some existing one will be used. Seeking the 
appropriate existing synthesis unit by DTW or another method, based on searching 
minimum distance between two segments, would result in very long search time. Hence, 
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in our experiments, replacing unit will be sought using Euclid distances between original 
short units from which the longer ones consist of. 

Unfortunately it can happen that any appropriate representative is found applying 
this method. In this case, the representative will be created. 

4.2 Creating the Representative for Non-existing Coding Unit 

The representative for non-existing coding unit can be made from original short repre- 
sentatives. These short representatives were created from tge original units before the 
process of re-segmentation. Only the longest synthesis unit from each class (from train- 
ing data set, of course) was chosen. Then, this unit was split into two halves, according 
to its middle frame. Both halves and the entire unit are going to be later used in creating 
long representative. 



5 HNM Synthesis 

In the previous experiments [I], [2], [3], LPCC synthesis was used to produce the output 
speech. Despite all re-segmentation methods, LPCC synthesis was highly responsible 
for the low quality of the resulting speech (that can be proved by a copy LPC analysis- 
synthesis). Therefore, the Harmonic-Noise Model (HNM) which brings much higher 
quality of the synthesized speech, was applied in our experiments. The principle of 
HNM is described in [7]. First, the pitch is detected for all the frames of analyzed speech. 
According to score of pitch detection, the frames are marked voiced or unvoiced. For all 
the frames, the parameters of the noise model are calculated. Furthermore, the parameters 
of the harmonic model are calculated only for voiced frames. The LPCC parameterization 
is still being used for TD, VQ and HMM recognition, because HNM features are not 
suitable for it. The representatives are modeled only by HNM parameterization. 



6 Results 

a. Quality of resulting speech: If new re-segmented units are short, the probability of 
not-existing representative for coding unit is small. Hence, an appropriate representative 
will be used almost every time. However, the influence of transition noise on resulting 
speech will not be anywise decreased. In case of too long re-segmented units, a small 
number of transitions appear in synthetic speech obtained by the concatenation of re- 
sulting units’ sequence. However, using not large enough DB in our experiments, the 
probability of non-existing representative is large. Hence, the non-existing coding unit 
has to be replaced by the most suitable existing one (or created from original repre- 
sentatives (more transition parts will appear there)). The quality of resulting speech is 
then lower, of course. Therefore, the optimal lengths of re-segmented units should be 
determined according to the desired quality of resulting speech and the availability of 
data. 

b. Bit rates: When applying the re-segmentation methods on original short units, the 
number of units in coding sentence is always less than without the re-segmentation. In 
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spite of this fact, the bit-rate does not necessarily decrease. The re-segmentation greatly 
increases the number of re-segmented unique units. Hence, more bits are needed when 
transmitting the indices of coding units. Whence it follows that resulting bit-rate depends 
on the lengths of new re-segmented units. In all our experiments, the original prosody 
as well as timing are used in the decoder, so that the bit rates refer only to encoding of 
unit and representative selection. Summary of all our experiments with average bit rates 
is given in Table 1 . 



Table 1. Table of results. 



version 


constl 


const2 


bit rate [bps] 


N 


1[%] 


sp. d. [dB] 


rec5 


- 


- 


129 


64 


100 


6.46 


nse 


4 


- 


219 


7363 


81 


5.51 


nsel 


7 


- 


157 


13656 


53 


5.73 


nse2 


9 


- 


137 


17144 


42 


5.81 


nseNV 


4 


- 


205 


9748 


65 


5.61 


nselNV 


7 


- 


155 


16955 


48 


6.01 


sps 


8 


3 


162 


19817 


50 


5.65 


spsl 


10 


4 


105 


20541 


32 


6.01 


sps2 


7 


4 


139 


19229 


45 


5.80 



bit rate [bps]: average bit rates, (without prosody, for one representative). 

N: Number of unique segments in training data set. 

1 [%]: Average relative number of segments in re-segmented label files to rec5 version, 
sp. d. [dB] : Spectral distortion between coded version and original speech (only for LPC synthesis). 
rec5 : Old segmentation, nse\ Re- segmentation according to middle frames of original units (constl . 
= min. length of new segments in frames), nseNV: Re-segmentation according to middle states 
of HMMs (constl. = min. length of new segments in frames), sps: Re-segmentation according 
to gravity centers of original units (constl. = min. width of old units in frames, const2. = max. 
number of original units that can be covered by new unit). 



7 Conclusion 

The purpose of applying the re-segmentation techniques was to reach higher quality of 
resulting speech coded by VLBR coders. This aim was achieved with all our experiments. 
Subjectively we can say that the best quality of resulting speech was obtained with “nse 1 ” 
version of the re-segmentation (objectively “nse” version, according to Table 1). Some 
examples of resulting speech can be found on: 
http://www.fee. vutbr. cz/^motlicek/speech. html. 

The speech coded only using original units (re-segmentation not used) and the resulting 
average bit rates are given there, as well. In our experiments, the prosody and timing 
(DTW) path were not coded. However according to [6], the sufficient average bit rate 
for coding prosody is about 200bps. With this value, the difference between original and 
coded prosody is almost indistinguishable. We can therefore expect the total bit rate for 
a speaker dependent coder to be of about 370 bps. 
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Abstract. In the paper we compare a selected collection of Chinese radicals and 
their meanings with Top Ontology developed in the framework of EuroWordNet 
1, 2 project (EWN). The main attention is paid to the question whether there are 
some interesting relations between them and if so whether the knowledge about 
them can be employed in building more adequate descriptions of natural language 
semantics. The result shows that Chinese organizes concepts in a very different 
manner from EWN TO. We discuss what potential implications this organization 
may have on the future development of EWN. 



1 Introduction 

The recent development in the area of NLP shows that a number of researchers pay 
attention to the problems of lexical meaning with relation to the systems for knowledge 
representation (e.g. EWN, CyC, HowNet). This research brings interesting results taking 
the form of the formalized lexical databases such as WordNet 1.5 [7] or multilingual 
EuroWordNet 1,2 [11] that is enriched with the Top Ontology (TO) and the set of Base 
Concepts (BC)[10] 

Let us describe shortly the nature of TO - it is a hierarchy of language independent 
concepts (though they are embodied in English) reflecting relevant semantic features, 
e.g. object, substance, origin, form, composition, dynamic and static, etc. TO consists of 
63 fundamental semantic features taken from various semantic theories and paradigms. 
Then there is a set of BC containing 1059 items. To handle their meaning, they have been 
classified hierarchically according to the TO mentioned above and specifically designed 
for the purpose of covering the whole vocabulary of a particular language and also for 
comparing them' . 



2 Ontologies, the EuroWordNet Top Ontology 



In the field of NLP, we can come across the artificial constructs called ontologies 
that have been developed recently within projects like EuroWordNet 1, 2 [11], or 



* There are eight languages in the EWN 1, 2 project. 
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systems of CYC type [5] with the primary purpose to serve as the lexical bases for 
the knowledge representation systems. In the current version of EuroWordNet 
TO (v.l), three types of entities are distinguished at the first level of TO (in the 
further explanation we follow [10]): 

— 1st Order Entity - any concrete entity publicly perceivable by the senses and 
located at any point in time, in a three-dimensional space, e.g. individual 
persons, animals and more or less discrete physical objects and physical 
substances. They are always denoted by (concrete) nouns. 

— 2nd Order Entity - any Static Situation (property, relation) or Dynamic Situ- 
ation, which cannot be grasped, heard, seen, felt as an independent physical 
thing. They occur or take place rather than exist, e.g. continue, occur, apply, 
and also events, processes, states-of-affairs or situations that can be located 
in time belong here. They can be expressed by nouns, verbs and adjectives. 

— 3rd Order Entity - unobservable propositions which exist independently of 
time and space. They can be true or false rather than real. They can be as- 
serted or denied, remembered or forgotten, e.g. ideas, thoughts, theories, 
plans, hypotheses, reasons, and they are always expressed by (abstract) 
nouns. 



3 Chinese Radicals 

It is well-known that Chinese script has originated from picture-writing (i.e. rep- 
resenting objects and concepts encountered in everyday life by pictures). How- 
ever, not all Chinese characters^ are pictographs. In fact, only about a couple 
hundreds of them are really pictographs [3]. According to the etymological dic- 
tionary written by Xu Shen around 100 A.D.^, Chinese characters can be divided 
into six groups [3, 6, 1]: 

1. pictographs (~ 4%): represent real-life objects by drawings, e.g. the ancient 
form of 0 (sun) resembles © and is a pictograph of a sheep with horns 

2. ideographs (~ 1%): represent positional and numeral concepts by indication, 
e.g. — (one), — (two), A. (three) 

3. logical aggregates (~ 13%) : form a new meaning by combining the meanings 
of two or more characters, e.g. #- (forest) is formed by putting more than 
one dv (tree) together 

4. phonetic complexes (~ 82%) : form a character by combing the meaning of 
one character and another character which links to the same sound, e.g. 
(moth) is formed by i (inseet) - the meaning component, and ^ (I/me) - 
the sound component 

^ As the Chinese script evolved in time, the original pictures had been simplified to 
‘written graphs’ [6] comprising more or less straight lines. Thus, it is perhaps more 
precise to use the term ‘graph’ instead of ‘character’. 

® In this dictionary, Xu Shen included only 9,353 characters. 
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5. associative transformations (a small proportion): extend the meaning of a 
character by adding more part(s) to the existing one, e.g. {emperor’s eourt) 
was transformed from I* (emperor’s eourt, eourtyard) by adding 

6. borrowings (a small proportion): to borrow the written form of a charac- 
ter with the same sound, e.g. M. (mile) originally means a town/small dis- 
trict/village 

There are roughly about 50,000 characters in the Chinese script, but an av- 
erage educated Chinese only knows roughly about 6,000 characters [3]. However, 
this rather limited knowledge of the Chinese script does not necessarily hinder a 
Chinese in acquiring the meanings of new characters. One reason for this is that 
many Chinese characters (especially the more complicated ones) are formed by 
combining two or more simpler characters and at least one of such components 
sheds some light on the meaning of the resulting character. Thus, the knowl- 
edge of a few thousands characters allows a Chinese to deduce the meaning of 
a previously unseen character‘d. For instance, when a Chinese encounters a set 
of ‘new’ characters like: 1 %, 4%, f|, 4|, knowing that means ‘a bird’, 

he/she can deduce that these characters probably mean some kinds of birds (in 
fact, they all are names of birds). 

If we take a look at the list of Chinese characters, we can immediately see 
that there is a special collection of graphs® that appear in a prominent position 
within many Chinese characters and they can be regarded as the basics. These 
graphs are called radicals, i.e. -Sli# (literally: section head, and also root (radix)). 
Though a Chinese character can contain more than one shape which is among 
the radicals, each character is grouped under only one radical. It is necessary to 
realize that most, if not all, Chinese radicals serve a similar function as individual 
words or meaningful morphemes in other natural languages. Thus, to a certain 
extent, we can view them as lexemes. 

The set of Chinese radicals contains approximately 214 items® and they are 
arranged according to the total number of strokes of brush that are required to 
write them. In our study, we have classified these radicals along the following 
lines: 

1. radicals denoting basic concepts, e.g.: zC (human being), $3 (bird), JiL (wind), 
'X. (fire), zjc (water) 

2. radicals functioning as verbs, e.g.: tb (to eompare), JU (to use), (to fly) 

3. radicals used as auxiliaries or particles, e.g.: 5b (again), rfii (yet, moreover, 
also), S. (to, extreme), (not) 

^ One would expect that this kind of deduction can be made more effective when the 
‘new’ character is encountered in a context. 

® Some, but not all, of these graphs are valid characters in their own right. 

® In many Chinese dictionaries, e.g. [4], [8], [9], the graphs M. and are considered to 
be two distinctive radicals (though they look very similar). However, in [12] which is 
a well-known and popular Chinese word dictionary, they are considered as the same 
radical. Thus, there are only 213 items in this dictionary. 
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4. radicals functioning as numerals, ordinals or units of measure, e.g.: — (one), 
Zj (a time/grade unit), — (two), (eight), -[■ (ten), "d" (ineh), ^ (a dry 
measure), (a unit of time) 

In the present study, our main interest is placed on the first group of Chinese 
radicals. We will examine whether there are any interesting relations between the 
concepts in TO and these Chinese radicals. We will also discuss what implication 
this study would have on the development of artificial ontologies. 

4 EWN TO versus Chinese Radicals a Comparison 

In this study, we are using the EuroWordNet (EWN) TO framework described 
by Vossen et al. [10]. We tried to associate each Chinese radical with the corre- 
sponding top concept(s) listed, mainly, in the 1st and 2nd order entities. If such 
associations can be found, it can be observed that the semantic features found in 
the collection of Chinese radicals can be reasonably linked with the expressions 
denoting the concepts in EWN TO. 

Eor most of the top concepts in EWN TO (e.g. Plant, Human, Creature, Ani- 
mal, etc.), groups of hyponyms can be found amongst the Chinese radicals qutie 
straightforwardly (cf. Sections 4.1 & 4.2). When a counterpart cannot be found, 
it can be deduced that such top concept is designed on a level of abstraction 
that is not immediately captured by a radical, but by more complicated charac- 
ter(s), e.g. Software (4t#). However, this is not always the case. Recall that at 
the present stage, we are focusing on the first group of Chinese radicals shown 
in Section 3. We have put aside a set of radicals which might correspond to 
SituationType and/or SituationComponent. This explains the result in Section 
4.2. 

Notice that many Chinese radicals grouped under IstOrderEntity can be 
classified by more than one top concept. Eor instance, }} (knife) is associated 
with Artifact, Solid and Instrument. This manifests a distinctive characteristic of 
EWN TO, i.e. to allow a concept to be represented as informative as possible 
through cross-classifications, which Chinese radicals are lacking. However, it is 
also worth noting that some Chinese radicals are homographs. Thus, they can 
be grouped under two, supposingly disjoint, subdivisions of the same class. Eor 
instance, a (bean / pea, a eontainer) is grouped under Container and Comestible, 
which are disjoint subdivisions of Function. Examples of characters under this 
radicals are ^ (an aneient elay eontainer) and SL (a kind of bean). 

4.1 EWN IstOrderEntity 



• Origin * Living 

o Natural 

P (eorpse) Q (sun, day) ^ (moon, 
month) X (fire) 
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o Plant 

(wood, tree^) H (melon) ^ (grain 
on the stalk, grain) Y] (bamboo) 
(riee ) «■ (g rass) a (bean / pea, a 
eontainer) (ehives) ^ (wheat) jU 
(hemp / flax) $ (millet) 
o Human 

A (human being) IL (human be- 
ing) J.' (seholar, gentleman) it (fe- 
male, woman, girl) f (son, teaeher 
/ philosopher, a time unit) iS (not)^ 
(father) ^ (old man, old) g> 
(minister / offeial) 
o Creature 

ifs (to show)^ ^ (ghost / evil spirit) 

o Animal 

(eattle) A (dog) ^ (sheep)^^ 
fit (tiger) rji (reptiles and inseets) 
^ (boar / pig) ^ (legless inseet 
or reptile, a mythieal animal)^^ if 
(short tail birds) (horse ) ^ (fish, 

to fish) (long tail bird birds) I® 
(deer) ® (tadpole, a green frog) 
(rat / mouse) SB (dragon) ifc (turtle 
/ tortoise) 
o Artifact 

/I (d esk) LI (reeeptaele) J) (knife) t, 
(spoon, dagger) ti (a seal) rf (towel, 
wrapping) 'j (bow) A (halberd) p 



(door, house) ff (axe) ^ (bamboo 
sword / halberb^^) JE. (dish / bowl) 
/(• (spear) A (arrow) (thread) {{j 
(wide bottom water-jar) |S§ (a net) 
(plow) ^ (a writing brush, from) ft 
(mortar) [‘’J (door) (leather) fn (tri- 
pod kettle) Sf (embroidery) HE (three- 
legged kettle) (drum) fSf (a three- or 
six-hole flute) 

• Form 

o Substance 

* Solid 

i (iee, to freeze, iey / eold) I; 
(soil, eountry / region, residenee) iJj 
(mountain) A (wood, tree‘s) _k (jade, 
gem stone) iQ (roe k) n (g rass ) ^ 
(metal, gold) ^ (soft leather, animal 
skin) gg (salt) 

* Liquid 

A (water) iJi (blood) 0 (a time 
unit)^^ pH (rain) 

* Gas 

H (air ) a ( wind) •§ (fragranee) 

o Object 

A (h uman being) JL (human being) f 
(eorpse) fi (tile, pottery) $ (vehiele) 

• Composition 
o Part 

D (mouth) iL' (heart) T (hand) it (to 



^ Though A is originally a pictograph of a tree, nowadays, it is very rarely, if at all, 
used to mean a tree. However, as a radical, A subsumes characters which denote 
kinds of plants as well as objects made of (or related to) wood. 

® # is in fact a pictograph of an adulterous woman behind barred door [4]. The 
character (mother) is grouped under this radical. 

® A can also mean ‘the spirit of the earth’ and it subsumes a list of characters whose 
meanings relate to ‘god’, e.g. ft (god), ft (a pronoun for god), (prayer), (to 
fast), or ‘to show respect’, e.g. (polite). 

It is interesting to note that A also subsumes characters which mean ‘group’, e.g.: 
tf (group), (some nation). 

^ subsumes characters which mean ‘a breast with claws’, e.g. IS (eat), IS (mink), 
# (jaekal). 

Although JL means a sword / halberb, its subsuming characters tend to mean attack, 
kill or break, i.e. an event linked with the use of a sword / halberb. 

W is a pictograph of a wine vase [4]. Its ancient meaning, though no longer being 
used nowadays, is liquor. Thus, many characters which are grouped under 15 refer 
to a kind of alcohol (ig) or events (or the state of the participants in these events) 
related to alcohol. E.g.: S|^, ii, Sf, it, all refer to a kind of alcohol. 
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stop, toes) t [hair ) ni (daw) if (a 
canine tooth) jE (l ower leg) ^ (skin) 
H (eye, eyeball) ^ (wings) if (ear) 
M3 (flesh) S' (tongue) ifli (blood) pj 
(horn ) (leg / foot) M (body) ffi 

(faee) If (head) I'l' (head) ^ (bone) u 
(long hair) (nose) (a tooth, the 
upper ineisors) •j’ (sprout / bud) ^ 
(braneh) ^ (wood, tree‘s) IJl (melon) 
(riee) (roof) (shelter, home 
/ house) p (door, house) ft (tile, pot- 
tery)^^ fj (door) >1 (plank) 
o Group 

K (surname) g (city, village / rural 
community / town) 5? (town / small 
district / village) ^ (small village, hill 
/ mound) 

• Function 
o Vehicle 

ISJ- (boat / ship) (vehicle) 

(horse)^^ 

o Representation 

flj (animal’s footprint)^^ 

* MoneyRepresentation 

ft (shell)^'^ ^ (metal, gold) 



* LanguageRepresentation 

(literature, sentence) D (to say, to 
speak) (language / dialect, speech) 
fi (sound) 

* ImageRepresentation 

75 (square, a place)^^ 

o Software 

Though the concept of software ex- 
ists in the Chinese language, e.g. #1 
^ (program), it is not of Chinese 
origin and it exists in the modern 
Chinese lexicon only. Thus, it is pre- 
dictable that no radical corresponds 
to this concept, 
o Place 

(a hill side area, a cliff) | | (enclo- 
sure, country )^^ I: (soil, country / re- 
gion, residence)'^^ «< (river) EH (field, 
land) (hole, cave, pit / trap) 
(valley) 3L (town / small district / vil- 
lage) ^ (small village, hill / mound) 
o Occupation 

_L' (scholar, gentleman) X (craftsman 
/ labourer, work) g (minister / offi- 
cial) 



Due to its second sense (i.e. pottery), fL refers to the objects which are made of 
baked clay. Thus, it subsumes a collection of characters whose meanings are related 
to ‘pottery’, e.g.: M (a water container), (china), ^ (a clay container). 

For a very long time, horses were the main means of transport in China. Thus, 
also subsumes the characters with meanings related to travelling with speed (e.g. ,|fc), 
to ride (e.g. .|f), or to drive (e.g. 1st). 

rt, as a verb, means the action of an animal rubbing its feet on the ground. Thus, as 
a noun, it means animal’s footprints. This radical is not common among characters, 
but its subsuming characters tend to mean ‘animal’ or ‘beast’. 

Shell was used as ‘money’ in ancient China. Thus, many of its subsuming characters 
represent the meanings which relate to the monetary system, e.g.: If (a collective 
term for valuables), 'M (money, capital), W (to buy), ^ (to sell). 

has many meanings, but it occurs more often as a place. However, as a radical, 
the meaning of most of its subsuming characters relate to ‘flags’. E.g. ^ and 

5% mean some sort of flags or banners. This is probably due to the fact that a flag 
or banner tends to be of a square-ish or rectangular shape. 

P is an ideograph representing an enclosure. It tends to refer to the area inside the 
enclosure. Thus, the characters under this group include H/H/lil (garden) and 0 
( country /kingdom) . 

X subsumes characters which refer to a physical object related to soil, a physical 
location or a place, e.g.: ^ (living room), iis, (the boundary of a district), i# (wide 
area), M. (grave), 4# (town). 
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Though as a character, can 
mean a teacher /philosopher (e.g. TL 
(Confucius), where TL is his sur- 
name, ^ is a title), most of the 
characters grouped under this radi- 
cal correspond to the meaning ‘son’, 
o Instrument 

JJ (knife) b (spoon, dagger) 'j (bow) 
(halberd) fr (axe) (bamboo sword 
/ halberb^^) if’ (spear) ^ (arrow) 
m (a net)“^^ (plow) (a writing 
brush, from)'^'^ (mortar) (vehi- 
cle) ® (drum) (a three- or six-hole 
flute) 

o Garment 

r|l (towel, wrapping) fft (thread) 
^ (elothes, outer-garments) (soft 
leather, animal skin) $■ (leather) 

o Furniture 

n, (desk) 

As furniture tends to be made of 

4.2 EWN 2ndOrderEntity 



wood, the characters for furniture 
tend to be grouped under (wood, 
tree’), e.g. (ehair), ^ (table), M. 
(wardrobe). 

o Covering 

rh (towel, wrapping) S (skin) 1^ (a 
ne t) iB\ ( eover, to eover) 

o Container 

LI (reeeptaele) L. (box / basket) C 
(box / ehest) M (dish / bowl) {ij (wide 
bottom water-jar) Q (bean / pea, a 
eontainer) m (tripod kettle) SJ| (three- 
legged kettle) 
o Comestible 

m (melon) ft (flesh) (riee) g (bean 
/ pea, a eontainer) (ehives) -ft (to 
eat, food) ^ (wheat) $ (millet) 

o Building 

r~ (shelter, home / house) p (door, 
house) 



• SituationType 
o Dynamic 

* BoundedEvent 

4 (to bear, to be born, to live) 

* UnboundedEvent 

X (fire) 

o Static 

r (to beeome ill, siekness) 

* Property 

A’ (dull blaek) S (white) (eolour) 
iff (red) ^ (indigo) tS (yellow) H 
(blaek) 

* Relation 

• SituationComponent 
o Cause 

4 (to bear, to be born, to live) 

* Agentive 



* Phenomenal 

X (fire) 

* Stimulating 

o Communicating 

o Condition 

(to beeome ill, siekness) 

o Existence 

4 (to bear, to be born, to live) 

o Location 
o Manner 
o Mental 

A (dull blaek) 6 (white) fti (eolour) 
^ (red) (indigo) Sf (yellow) ^ 
(blaek) 

o Modal 



1*3 subsumes characters which mean some sort of nets or actions and manners relating 
to the use of a net. 

means a brush for writing (i.e. pen), but its subsuming characters are mainly 
conveying abstract senses, e.g. (majestie, striet, quiet), ^ (beginning, to begin). 
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o Physical 

:K{fi re) 4 (to bear, to be born, to live) 
(to beeome ill, siekness) 

o Possession 
o Purpose 
o Quantity 
o Social 



o Time 

(a time / grade unit) ^ (dusk) 
T (son, teaeher / philosopher, a time 
unit) Li (a time unit) □ (sun, day) 
(moon, month) ^ (bitter, a time unit) 
M (d time unit) @ (a time unit) 
o Usage 



4.3 EWN SrdOrderEntity 

It appears that no direct correspondence can be found between the concepts in 
the SrdOrderEntity group and Chinese radicals. The TO concepts in this group 
display propositional nature. It seems that Chinese radicals are not abstract 
enough to serve as the counterparts of concepts like plan, theory, information, 
etc. Thus, this kind of abstract concepts have to be constructed in a more com- 
plicated way in Chinese. For instance, the concept of idea can be denoted by 
M, which is grouped under the radical it' (heart); whereas the concepts of plan, 
theory and information can be represented by iff (plan), ^ (form, format) 
and ifl (information) respectively, in which radicals serve only as components of 
characters. Thus, in our opinion, th relations between the 3rd order entities and 
Chinese characters (other than radicals) should be examined in an independent 
study. 



5 Discussion 

The comparison of EWN TO structure and Chinese radicals as given in Section 
4 offers several interesting observations: 

1. They display more anthropomorphic nature than EWN TO. This certainly 
follows from the fact that Chinese radicals came from a natural language in 
which some concepts (e.g. sun, moon, fire, water, ghost, etc.) are regarded as 
more important than the others. 

2. They display more ambiguity than EWN TO items because they belong to 
the fairly frequent items in the language. 

EWN TO has a (more or less) top-down and hierarchical logical structure. 
It classifies basic real world concepts into groups according to their natures, 
functions, forms, etc. Such a tightly woven structure of concepts does not readily, 
if at all, exist in the rather fragmented set of Chinese radicals, though it is 
possible to group some of them in a similar manner as the EWN TO. However, 
this does not imply that Chinese classifies concepts (represented by Chinese 
characters) in an arbitrary manner. Rather, concepts are organized in a different 
manner. 

Consider the radical @ (a time unit), whose ancient meaning was ‘liquor’. 
It subsumes a wide range of characters whose meanings correspond to different 
aspects of (and also objects within) events involving alcohol. For instance: 
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• alcohol as an object: (alcohol), (a kind of liquor), ia (vinegar) 

• alcohol as a consequence: Wt (the faee turn red after drinking), (drunk) 

• alcohol as the distinctive part of an event: Sf/®i (to pour aleohol on the floor 
during a saerifiee), ii (to get together for drinking), (to make aleohol), (to 
eure), SI (to wake up) 

• alcohol as a property: St. (sour), Sf (strong flavour) 

• alcohol’s component: S# (yeast) 

Similarly, A (human being) corresponds to concepts concerning human beings; 
whereas iL' (heart) and ' j (bow) correspond to concepts involving a heart (i.e. emo- 
tions and mental events) and a bow respectively. This seems to suggest that 
Chinese organizes concepts in a contextual manner and each Chinese radical 
represents the characterizing basic concept in each context. Thus, it is not sur- 
prising that when we take a group of Chinese characters under the same radical, 
we can obtain a mini semantic network which resembles the hierarchy of EWN 
TO. 

The current version of EWN allows the cross-referencing between two closely 
related concepts, e.g. death - died. However, it does not seem to allow the 
referencing of concepts appeared within a particular context. While Chinese 
has been exploiting the idea of organizing concepts in a contextual manner for 
centuries, this suggests that context might also be an effective means to organize 
real world concepts. Indeed, as we have demonstrated in Section 3 (cf. the 
(bird) example), this organization helps a Chinese to acquire previously unknown 
concepts effectively. 

6 Conclusions 

The results of the comparison prove that, within the collection of selected Chi- 
nese radicals, we can find the elements which correspond reasonably to the el- 
ements of EWN TO. However, we have to keep in mind that EWN TO is an 
artificial construct whereas the radicals reflect a “semantic” situation in a nat- 
ural language. Though Chinese radicals do not form a well-defined hierarchical 
system as TO does, in our comparison, we have shown that many of the im- 
portant counterparts of TO entities can be found among them. In other words, 
we can conclude that, using them, we would be able to create a construction 
similar to EWN TO. In our view, this is a positive message since it tells us that 
we do not need to be afraid so much of the arbitrariness which is an inevitable 
property of any ontology of EWN TO type. 

Note that >§ is made up of a variant of ?!<. (water) and W. While W was an ancient 
pictograph of a wine vase, combining with the concept of water (i.e. the most common 
kind of liquid) forms the concept of alcohol. In fact, many Chinese characters display 
such kind of interesting combination of meaning, e.g.: (to blow) and ik. (to eook), 

where A means ‘to blow’. 

Notice that the usage of is not restricted to the sense of being drunk due to 
the intake of alcohol. It is also frequently used in metaphorical speech, e.g. a guy 
becomes ‘drunk’ due to a girl’s beauty (i.e. drunk emotionally). 
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One might argue that the similar meaning elements can in fact be found in 
any culturally developed natural language, thus why Chinese radicals should be 
considered as being so special? In our opinion, their exclusiveness consists in the 
fact that they represent a natural language collection of basic meaning elements 
which do not exist in such a compact form in any of the known natural languages 
- there are only 214 radicals. 

The nature of the above comparison leads us to think that it would be useful: 

• to examine the collection of Chinese radicals and their structural and semantic 
relations more deeply, and 

• to apply the results of this exploration to the existing artificial TO 

with the hope that the existing TO can be appropriately modified to reasonably 
minimize their arbitrariness. 
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Abstract. Most speech recognition systems use a language weight to reduce the 
mismatch between the language model and the acoustic models. Usually a constant 
value of the language weight is chosen for the whole test set. In this paper, we 
evaluate the possibility to adapt the language weight dynamically to the state of 
the dialogue or to the current utterance. Our experiments show, that the gain in 
performance, that can be achieved with a dynamic adjustment of the language 
weight on our data is very limited. This result is independent of the information 
source that is used for the adaption of the language weight. 



1 Introduction 

1.1 Motivation 

In most current speech recognition systems the decoding of a spoken utterance is done 
with the use of statistical models. This means, given an acoustic input X, the recognizer 
performs a search for the best matching word chain w, which maximizes the a-posteriori 
probability p{w\X). Because it is hard to model the corresponding density function 
directly, usually the value is determined using the Bayes formula. During the decoding 
phase of a speech recognizer, there is no interest in the real value of p('i&|X), only the 
best matching word chain w is needed. Therefore, usually only the product p(X|'i&) • 
P{w) is evaluated for maximization. However, practical experience shows that there is 
a mismatch in the output scores of the acoustic models and the language model. Most 
speech recognizers introduce a language weight (also called linguistic weight) a to the 
score of the language model in order to make it fit better to the acoustic model. As a 
consequence, during the decoding phase of a speech recognizer, the algorithm searches 
for a word chain w with 

w = argmaxp(JV|iu) • P{w)°' (1) 

W 

Usually, the value of a is constant for the whole test set. However, we can expect 
that using only one global language weight for the combination of acoustic and language 
model scores in the recognizer may not be optimal and it may be better to adjust the weight 
dynamically. In this paper, we describe experiments that evaluate the influence of the 
language weight on the word error rate. We also investigate into the possibility to choose 
the language weight dependent on individual sentences. Because the performance of the 
language model depends on the dialogue-state, we evaluate, if the optimal value of the 
language weight is correlated to the actual dialogue-state. A more detailed description 
of the experiments can be found in [1]. 



V. Matousek et al. (Eds.): TSD 2001, LNAI 2166, pp. 323-328, 2001. 
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1.2 Previous Work 

To our knowledge, only a few approaches concerning the adjustment of the language 
weight have been published so far. In [2] the scores of the acoustic and the language 
model are seen as opinions of experts. The scores are weighted with a reliability factor 
and linear interpolation is used to combine them. The work in [3] utilizes word dependent 
language weights which are determined heuristically in order to improve the language 
model. The Unified Stochastic Engine as described in [4] improves performance by word 
dependent language weights, which are estimated jointly with other model parameters. 

2 Data 

All evaluations were done on 20074 utterances, which have been recorded with the 
conversational train timetable information system Evar, as it is described in [5]. Nearly 
all utterances are in German language. The total amount of data is ca. 23 hours, the 
average length of an utterance is four seconds. 15722 utterances have been selected 
randomly for training, 441 for the validation of the speech recognizer. The rest of 391 1 
utterances is available for testing. All utterances have been labeled with the current 
dialogue-state. As in the work of Eckert et al. [6], the dialogue-states are defined by the 
question the user is replying to. Examples for the questions are what information do you 
need? or where do you want to go?. We only take the six most frequent questions into 
account, and map the less frequent questions to the more frequent ones. Experiments for 
the global optimization of the language weight are also compared to the results gained on 
a different dataset which is a subset of the data collected in the Verbmobil project [7]. 
The Verbmobil database contains spontaneous speech utterances which are related to 
appointment scheduling. The subset used for the experiments which are described in this 
paper contains ca. 46 hours of data and 21147 utterances. 15647 utterances have been 
selected randomly for training, 500 for the validation of the speech recognizer, 5000 are 
used for testing. 

The recording scenarios cause differences between the datasets in various aspects: 
The average utterance length of the Verbmobil data is 8 seconds (Evar: 4 seconds). 
The number of distinct words in Evar is 2158, in V erbmobil it is 8822. The perplexity 
of the corresponding 4-gram language model is much higher for the Verbmobil data 
(91.2) than for the Evar data (14.1). 

3 Short Description of the Baseline System 

The baseline system which has been used for the experiments, is a speaker independent 
continuous speech recognizer. It is based on semi-continuous HMMs, the output densities 
of the HMMs are full-covariance Gaussian. The recognition process is done in two 
steps. Eirst, a beam search is applied, which generates a word graph. The beam search 
uses a bigram language model. In the second phase, the best matching word chain is 
determined by an A*-search, which rescores the graph with a category based 4-gram 
language model. In both phases, a different language weight is used. For the experiments, 
which are described below only the language weight that is used during the A*-search 
is varied. A more detailed description of the speech recognizer can be found in [5]. 
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4 Influence of the Language Weight on the Error Rate 

At first, we attempt to get baseline results concerning the relationship between word 
error rate and the language weight. Therefore, we varied the language weight in a large 
interval, and computed the corresponding word error rate on the test set. As a result we 
get the word error rate depending on the value of a global language weight. 



5 Dialogue-State Dependent Language Weight 

The next step is to perform a dialogue-state dependent optimization of the language 
weight. As mentioned before, the motivation for this approach is the observation, that 
the performance of the language model depends on the dialogue-state [8]. We examine, 
if we can achieve a better adjustment between acoustic and language model by choosing 
a dialogue-state dependent value of the language weight. 



6 Utterance Dependent Language Weight 

A further refinement in the adjustment of the language weight is possible by adapting its 
value to the current utterance. However, it is not clear, which information sources could 
be utilized in order to determine the optimal value during recognition. One possibility 
among others are confidence measures, which estimate the reliability of the language 
model or of the acoustic model. Other information sources could be used also, for 
example the current speaking rate. In this paper, we measure the maximum reduction in 
word error rate that could be achieved by such an approach. That means, we compute 
the optimal value of the language weight for each utterance in the test data individually. 
The word error rate on the test set is evaluated with the optimal language weights. 



7 Experimental Results 

7.1 Influence of the Language Weight on the Error Rate 

For the global optimization of the language weight four different validation sets are 
selected from the Evar training data. Each of the subsets contains 500 utterances. In 
Table 1 the word error rates which have been computed on the Evar test set are shown. 
Depending on the particular validation dataset, word error rates between 26.9% and 
26.1% have been achieved. The word error rate on the test set is 26.9% at the value of 
the language weight which is optimal on the validation data (6.0). In order to evaluate the 
maximum performance that could be reached, we also use the test set for optimization. 
This experiment results in a word error rate of 26. 1 %, the corresponding language weight 
is 8.0. 

The curve in Figure 1 depicts the progression of the word error rate depending on the 
language weight for the test sets of the Evar and V erbmobil databases. On the Evar 
test set, the word error rate is nearly constant in the range from 5 to 9. The curve for the 
Verbmobil data has a minimum at a language weight of 4, the word error rate rises 
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Table 1. Global optimization of the language weight on the Evar dataset. For each validation set 
language weight and word error rates on the validation and test set are shown. Please note that 
each validation set was excluded from the training data for the acoustic and language models, so 
the rows of the table correspond to recognizers, which have been trained differently. 



validation 

set 


optimal 

language weight 


WER [%] 
on validation set 


WER [%] 
on test set 


1 


6.0 


20.5 


26.9 


2 


4.8 


25.3 


26.5 


3 


7.1 


24.0 


26.8 


4 


7.5 


23.5 


26.1 


test set 


8.0 


- 


26.1 



EVAR dataset 




Verbmobil dataset 




language weight 



Fig. 1. Relationship between word error rate and the language weight for the test subsets of the 
Evar and the Verbmobil data. 



slowly for larger weights. While both curves are relatively smooth their most significant 
difference is the position of the global minimum, which seems to be data-dependent (see 
Section 2). 



7.2 Dialogue-State Dependent Language Weight 

For dialogue- state dependent optimization of the language weight we use three different 
validation sets, each contains 2000 utterances. The results can be seen in Table 2. When 
compared to the global optimization of the language weight, only slight improvements 
of the word error rate can be achieved. Depending on the validation set that has been used 

Table 2. Dialogue-state dependent optimization of the language weight. For each dialogue-state 
(DSO, .., DS5) the language weight is chosen separately. The optimal weights and the word error 
rates for the test set are shown. 



validation 

set 


DSO 


opt. 

DSl 


langua 

DS2 


.ge wei 
DS3 


ghts 

DS4 


DS5 


WER [%] 


1 


6.8 


7.8 


7.1 


5.7 


5.8 


6.1 


26.2 


2 


8.6 


5.1 


7.3 


4.9 


7.1 


8.4 


27.2 


3 


7.5 


6.1 


6.2 


5.5 


7.9 


6.1 


26.4 
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for the optimization, the word error rate on the test set lies between 27.2% and 26.2%. 
The optimal values of the language weights have a very high variance. In some cases, 
the variance for the same dialogue-state across the validation data sets is even higher 
than the variance across different dialogue-states in the same validation data set. The 
high variance may be caused by an insufficient size of the data sets. Another explanation 
is that there is no or only little correlation between the optimal value of the language 
weight and the current dialogue-state. 

7.3 Utterance Dependent Langnage Weight 

Because we want to measure the maximum possible reduction in word error rate, we do 
not use a separate validation set for this experiment. The language weight is optimized 
for each utterance of the test set. The approach results in a word error rate of 25.4% 
on the test set, which corresponds to a reduction of 0.7 percent points or 2.8% relative. 
The slight improvement is the best possible reduction of the word error rate that can be 
achieved with an utterance dependent language weight. For this result, it is irrelevant, 
which information source or confidence measure can be utilized for the adaption. In 
order to understand the results better, we analyzed the change in the word error rate on 
individual sentences dependent on the language weight. We discovered, that the error 
rate on 1046 (27%) of the sentences in the Evar test set is completely independent of 
the language weight. In 2991 sentences (77%), the error rate does not change, when the 
language weight is varied in a range from 1 to 9. Utterances which have an error rate 
that does not depend on the language weight, are significantly shorter than the average. 



8 Conclusion and Outlook 

In most speech recognition systems, the score of the language model is manipulated with 
a language weight. Usually a constant value of the language weight is chosen for the 
whole test set. In this paper, we evaluated the possibility to adapt the language weight 
dynamically to the state of the dialogue or to the current utterance. Our experiments 
show, that a dynamic adjustment of the language weight does not necessarily result in a 
better performance. This may be caused by the great portion of relatively short utterances 
in our data, which is typical for dialogues in an information retrieval system. 

Further experiments will investigate into the V erbmobil data set, which contains 
utterances with a larger duration. We also plan to apply alternative methods (e.g. [9]) 
which are not necessarily based on a weighting factor to reduce the mismatch between 
acoustic model and language model. 
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Abstract. Most feature extraction techniques involve in their primary stage a Dis- 
crete Eourier Transform (DPT) of consecutive, short, overlapping windows. The 
spectral resolution of the DFT representation is uniform and is given by Af = 
2-k jN where N is the length of the window The present paper investigates the 
use of non-uniform rate frequency sampling, varying as a function of the spectral 
characteristics of each frame, in the context of Automatic Speech Recognition. 
We are motivated by the non-uniform spectral sensitivity of human hearing and 
the necessity for a feature extraction technique that auto-focuses on most reliable 
parts of the spectrum in noisy cases. 



1 Introduction 

Contemporary ASR systems are composed of a feature preprocessing stage, which aims 
at extracting the linguistic message while suppressing non-linguistic sources of variabil- 
ity, and a classification stage (including language modelling), that identifies the feature 
vectors with linguistic classes. The extraction level of current ASR systems converts the 
input speech signal into a series of low-dimensional vectors, each vector summarizing the 
necessary temporal and spectral behaviour of a short segment of the acoustical speech 
input. The ultimate goal is to estimate the sufficient statistics to discriminate among 
different phonetic units while minimizing the computational demands of the classifier. 

Based on the fact that oversampling cannot increase spectral resolution, but merely 
has an Interpolating effect, our technique expands a window of speech signal of N sam- 
ples by oversampling each window using 2N points DFT and subsequently reducing the 
transformed 2N samples to N samples on the average by de-emphasizing the selection of 
spectral coefficients from spectral valleys in favour of samples around a spectral transi- 
tion. Our approach attempts to establish a link between non-uniform frequency sampling 
and the non-uniform spectral sensitivity of human hearing which is most sensitive in 
the 1-3 kHz range, therefore, emphasizing the second and third formant region [1]. We 
are encouraged by the fact that, from an informational point of view, transition is more 
fruitful than repetition. Furthermore, in noisy conditions, the high-energy parts of the 
signal (e.g. formants) are less disrupted by the effect of noise, therefore, more reliable 
for the recognition scope [2]. 
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2 Description of the Algorithm 

Let x{m) be a finite duration time domain signal with a length of N samples, correspond- 
ing to a pre-emphasized window. DFT provides a mapping between the N uniformly 
spaced samples of the time sequence and a discrete set of N uniformly spaced frequency 
domain samples 

N-l 

X{k) = (1) 

m—0 

The length N of the window defines the uniform spectral resolution of the Fourier 
representation which is Af = 2 tt/N. We expand x{m) from N to 2N samples by 
padding N zeros and thus obtaining the sequence [x(0), . . . , x{N — 1), 0, . . . , 0]. The 
DFT of the padded signal is given by: 



2N-1 

X{k) = (2) 

m—0 

Although zero padding cannot increase spectral resolution it introduces an interpo- 
lating (smoothing) effect in the frequency domain [3]. The latter becomes obvious if we 
consider that the {AT(0), AT(2), AT(4), . . . ,X(2N — 2)} spectral samples are the same 
as those that would be obtained from a DFT of the original N time domain samples 
(see Eq. 2). The remaining N samples {W(l), 2f(3), 2f(5), . . . ,X(2N — 1)} are the 
interpolated spectral samples that result due to the zero padding [4] . 

Let E{k) = |AT(A:)| denote the amplitude of each complex DFT value, where 
k=l,...,2N. 

Let T{k)hea local threshold of the median of 50 previous and 50 following amplitude 
values of sample X{k)\ that is 

T{k) = median{E{k — 50), E{k — 49), . . . , E{k), . . . , E{k + AQ),E{k + 50)). 

Regarding the first and last 49 spectral amplitudes within a frame, the first and last local 
threshold is repeated. 

Let denote the spectral slope. Since the samples are uniformly partitioned in 

the spectral domain, the slope is reduced to a simple distance among adjacent spectral 
amplitudes. The acceptance or rejection rule of each spectral value of the over-sampled 
window is defined as: 



{Ak = Ak-I + - ni)l > 7 * 10-^T(A:)) (3) 

If accept the spectral value X{k), reset the accumulator (Aq = 0) and continue from 
sample AT(fc -F 1) to the next sample that forces the accumulator to exceed the threshold 
value. The process is repeated until the end of the frame. 

Slope weighting favours adjacent spectral values within the same frame that differ 
significantly and de-emphasizes spectral valleys. Therefore, spectral samples around 
a spectral transition within the frame are sampled with high density, while samples from 
steady parts are sparsely selected. Slope weighting is combined with energy weighting in 
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order to de-emphasize the contribution of slope values corresponding to noise. Figure 1 
demonstrates that in noisy situations the algorithm emphasizes the selection of spectrum 
samples that rise above the noise level. 





Sample points retainedAejected: Black lines indicate samples retained. 



Fig. 1. Word Recognition Accuracy (%). Cepstral coefficients are Mean Normalized. 



Let the selected spectral coefficients of frame (p) be 

e(p) = [E{0),...,E{M -1)]M G [1, . . . , 2W - 1]. 

The value 7 * 10“® in the equation of the accumulator is empirically chosen in 
order to achieve an average frame size of N samples. Each is filtered by a set of 
20 Mel-spaced, triangular band-pass, filter-bank channels and, since the size of can 
vary from frame to frame, the filter-bank weighting factors b G are evaluated in 

(v) 

the corresponding frequencies of the selected samples. Let be the energy output of 
log-filterbank (n) regarding frame (p). 

* 6 (4) 

Applying a Discrete Cosine Transform on n = 1, . . . , 20 results to the Mel- 
Frequency Cepstral Coefficients (MFCCs) defined as: 

4^^ = cos 

n— 1 

Evaluation of Eq. 4 for q = 1, . . . , 12 leads to a 12 dimensional MECCs to which we add 
a log-energy value for each speech frame resulting in a 13 dimensional feature vector. 




332 



I. Potamitis, N. Fakotakis, and G. Kokkinakis 



3 Simulation and Results 

Word Recognition Accuracy was assessed by using a speech recognition module built 
with HTK Hidden Markov Models toolkit, using part of the identity card corpus of the 
SpeechDat database. The training set consisted of 1000 speech files and the testing set 
of 200 files. The basic recognition units are tied state context dependent triphones of five 
sfafes each. Given fhis sef of HMMs, and fhe corresponding dictionary, the HTK recog- 
nition unit produces the best path of the word network using the Frame Synchronous 
Viterbi Beam Search algorithm.. After applying the variable frequency sampling pro- 
cedure, a 20 Mel-spaced triangular band-pass filter-bank is imposed to the spectrum. 
Thirteen dimensional feature vectors are formed after applying DCT to log-filter-bank 
outputs, which reduces the 20 output channels into 12 dimensional MFCC features plus 
a log-energy value. Cepstral mean normalization was applied to deal with the constant 
channel assumption. To assess the effectiveness of the variable frequency sampling tech- 
nique, Deltas and double-Deltas were not concatenated to the final observational vector 
since for the comparison of the new feature set, we were not interested in absolute per- 
formance. The results of the tests conducted are depicted in Table 1 and demonstrate 
that our enhancement method cooperates well with the HMMs framework. Table 1 com- 
pares the recognition results, achieved with standard MFCCs and MFCCs with a variable 
frequency sampling stage interposed. The results demonstrate consistent superiority of 
non-linear frequency sampling both in clean and noisy situations compared to the uni- 
form DFT sampling. However, we should stress that variable rate frequency sampling 
is not a denoising technique. 



Table 1. Word Recognition Accuracy (%). 
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4 Conclusions 

In this work, a novel data-driven feature extraction technique is presented. Our approach 
links the concept of spectral sampling in the context of ASR with the spectral character- 
istics of each frame. This leads to a non-uniform sampling algorithm, which compared to 
established MFCCs increases the performance in clean conditions while improving the 
noise robustness. We wish to emphasize the practical advantage of our method, which 
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does not seek to revise but, rather, furthers the discriminative ability of the already 
existing and successful MFCC front-end. 
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Abstract. The Phonectic SMS Reader is a speech technology test application 
that enables sending SMS messages and receiving them in voice format. The 
application can be viewed as a new value-added service for a mobile telephony 
operator. 

The test system described includes two mobile telephones, one for intercepting 
SMS messages and another one for calling the message receivers. The text message 
is converted to speech by a text-to-speech system. 



1 Introduction 

The Phonectic SMS Reader is a speech technology test application that enables sending 
SMS messages and receiving them in voice format. The application can be viewed as 
a new value-added service for a mobile telephony operator. 

The SMS messages have to include the telephone number of the addressee as well 
as the message to be read. The test system described includes two mobile telephones, 
one for intercepting SMS messages and another one for calling the message recipients. 
The text message is converted to speech by the Phonectic text-to-speech system based 
on the S5 system [1]. 

2 The Phonectic SMS Reader 

The Phonectic SMS Reader test application consists of 3 main modules as shown in 
Fig. 1. The first module intercepts the incoming SMS message and performs the parsing 
of the message. The following information is send to the next modules: message ID, 
message text, sender phone number, addressee phone number, time stamp and ID (file 
with complete path). 

The text part of the message is sent to the second module, the voice server, that per- 
forms the text-to-speech conversion. The final message transmission module composes 
the final message, e.g. ‘You have received a message from the number 031765249. The 
message is as follows: Speech synthesis test. End of message’. Finally, the message is 
sent to the addressee. 

In the following sections the Phonectic TTS system and its principles are described. 
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Fig. 1. The architecture of the Phonectic SMS Reader. A sample message sent to the system is 
‘041765249 Speech synthesis test’, meaning the message ‘Speech synthesis test’ will be sent to 
the number 0417656249. 



3 The Phonectic TTS System 



Text-to-speech synthesis (TTS) enables automatic conversion of any available textual 
information into its spoken form. 







336 



J. Gros et al. 



The input text is transformed into its spoken equivalent by a series of modules, which 
we describe in detail. A grapheme-to-phoneme or -to-allophone module produces strings 
of phonetic symbols based on information in the written text. The problems it addresses 
are thus typically language-dependent. A prosodic generator assigns pitch and duration 
values to individual phones. Final speech synthesis is based on diphone concatenation 
using TD-PSOLA [2]. 

3.1 Grapheme-to-Allophone Transcription 

Input to the S5 system is free text. For the time being, input text should be stored in 
ASCII format; currently we are expanding the input possibilities so that in future it may 
come from other programs or marked regions on the computer screen. 

Input text is translated into a series of allophones in two consecutive steps. First, 
input text normalization is performed. Abbreviations are expanded to form equivalent 
full words using a special list of lexical entries. The text normalizer converts further 
special formats, like numbers or dates, into standard grapheme strings. The rest of the 
text is segmented into individual words and basic punctuation marks. 

Next, word pronunciation is derived, based on a user - extensible pronunciation 
dictionary and letter-to-sound rules. The dictionary covers over 16.000 most frequent 
inflected word forms. 

In case where dictionary derivation fails, words are transcribed using automatic 
lexical stress assignment and letter-to-sound rules. However, as lexical stress in Slovene 
can be located almost arbitrarily on any syllable, this step can introduce errors into the 
pronunciation of words. 

Automatic stress assignment is to a large extent determined by (un)stressable affixes, 
prefixes and suffixes of morphs, based upon observations of linguists [3]. 

For words which do not belong to these categories, the most probably stressed syllable 
is predicted using the results obtained by a statistical analysis of stress position depending 
on the number of syllables within a word. 

Finally, a set of over 150 context-dependent letter-to- sound rules translate each word 
into a series of allophones. 

3.2 Prosody Generation 

A number of studies suggest that prosody has great impact on the intelligibility and 
naturalness of speech perception. Only the proper choice of prosodic parameters, given 
by sound duration and intonation contours, enables the production of natural-sounding 
high quality synthetic speech. Prosody generation in S5 consists of four phases: 

- intrinsic duration assignment, 

- extrinsic duration assignment, 

- modelling of the intra word FO contour and 

- assignment of a global intonation contour. 

The first and the third phase are sometimes referred to as microprosodic parameter 
assignment, since they are performed on speech units smaller than a word. The second 
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and the fourth phase are also called macroprosodic parameters determination, since they 
operate above the word level. 

A speech database consisting of isolated words, carefully chosen by phoneticians 
[4], was recorded in order to study different effects on phone duration and fundamental 
frequency, which operate on the segmental basis. Vowel duration and fO were studied in 
different types of syllables: stressed/unstressed, open/closed. Consonant duration was 
measured in CC and VCV clusters [5]. 

Another large continuous speech database was recorded to study the impact of speak- 
ing rate on syllable duration and duration of phones [6]. A male speaker was instructed 
to pronounce the same material at different speaking rates: at a normal, fast and slow 
rate. Thus context, stress and all other factors were kept identical to every realisation of 
the sentence. As a result, pair- wise comparisons of phone duration could be made. The 
effect of speaking rate on phone duration was studied in a number of ways. An extensive 
statistical analysis of lengthening and shortening of individual phones, phone groups 
and phone components, like closures or bursts was performed, the first of the kind for 
the Slovenian language [6] . 

Pair-wise comparisons of phone duration were calculated. Average mean duration 
differences and standard deviations were computed for pairs of phones pronounced at 
different speaking rates. Prior to the comparison, phone duration was normalized to the 
corresponding normal phone duration. Pairs were first composed of normal and slow 
rate phones, and later of fast and normal rate phones. 

The closures of plosives change but slightly and maintain almost the same duration 
regardless of the speaking rate. Short vowels, contrary to long vowels, increase more 
in duration when speaking slower than they do shorten when speaking faster. From 
these observations we may draw a conclusion: phones or phone components, which 
are considered as short by nature, except for plosive bursts, increase more in length at 
a slow rate than they do shorten at a fast rate. The opposite holds for affricates and long 
vowels. Articulation rate expressed as the number of syllables or phones per second, 
excluding silences and filled pauses [7], was sfudied for fhe different speaking rates. In 
other studies, articulation rate is usually determined for speech units with the length of 
individual words or entire phrases. We studied the articulation rate of words along with 
their associated cliticised words at different positions within a phrase: isolated, phrase 
initial, phrase final and nested within the phrase. 

The articulation rate increases with longer words, as average syllable duration tends 
to decrease with more syllables in a word. The articulation rate immediately after pauses 
is higher than the one prior to pauses. A set of measurements was made in order to 
define four fypical intonation contours based on four Slovenian basic intonation types 
[8] . Read newspaper articles were processed by an AMDF pitch extractor. Then, a manual 
piecewise linearization of FO curves into pitch contours was performed. Our interest was 
to detect typical prosodic segments by means of FO contours. 

Duration Modelling. Regardless of whether the duration units are words, syllables 
or phonetic segments, contextual effects on duration are complex and involve multiple 
factors. Similarly to [9], our two-level duration model first determines the words’ intrinsic 
duration, taking into account factors relating to the phone segmental duration, such 
as: segmental identity, phone context, syllabic stress and syllable type: open or closed 




338 



J. Gros et al. 



syllable. Further, the extrinsic duration of a word is predicted, according to higher-level 
rhythmic and structural constraints of a phrase, operating on a syllable level and above. 
Here the following factors are considered: the chosen speaking rate, the number of 
syllables within a word and the word’s position within a phrase, which can be isolated, 
phrase initial, phrase final or nested within the phrase. 

Finally, intrinsic segment duration is modified, so that the entire word acquires its 
predetermined extrinsic duration. It is to be noted that stretching and squeezing does not 
apply to all segments equally. Stop consonants, for example, are much less subject to 
temporal modification than other types of segments, such as vowels or fricatives. 

Therefore, a method for segment duration prediction was developed, which adapts 
a word with an intrinsic duration ti to the determined extrinsic duration te, taking into 
account how stretching and squeezing apply to the duration of individual segments [6]. 
The reliability of our two-level prediction method was evaluated on a speech corpus 
consisting of over 150 sentences. The predicted durations were compared to those in the 
same position in natural speech. Natural duration variation was evaluated by averaging 
the duration differences for words, which occurred in the corpus several times, in the 
same phonetic environment and in the same type of phrase. 

Standard deviation of the difference between natural and predicted duration dif- 
ference is 15.4 ms for normal speaking rate, and even less for stressed phonemes the 
duration of which is of crucial importance to the perception of naturalness of synthetic 
speech. 

Intonation Modelling. Since the Slovenian language has been defined as a pitch accent 
language [4], special attention was paid to the prediction of tonemic accents for indi- 
vidual words. First initial vowel fundamental frequencies were determined according 
to previous measurements as suggested by [4], creating the FO backbone. Each stressed 
word was assigned one of the two tonemic accents, characteristic for the Slovenian lan- 
guage. The acute accent is mostly realized by a rise on the posttonic syllable, while with 
the circumflex the tonal peak usually occurs within the tonic. Five typical FO patterns 
were chosen from the variety of FO patterns described in [4]. Finally a linear interpolation 
between the defined FO values was performed. 

We used a relatively simple approach for prosody parsing and the automatic pre- 
diction of Slovenian intonational prosody which makes no use of syntactic or semantic 
processing [10], but rather uses punctuation marks and searches for grammatical words, 
mainly conjunctions which introduce pauses. We considered it more important to predict 
the word FO contour modeling the tonemic accent as reliably as possible than to explore 
sentence intonation. 

The drawbacks of such a syntactically independent prosodic parser are important, as 
in many cases prosodic parameters are determined by the syntactic structure of a phrase 
and cannot be reliably estimated without a deep syntactic or even semantic analysis. 

3.3 Speech Segment Concatenation 

Once appropriate phonetic symbols and prosody markers are determined, the final step 
within S5 is to produce audible speech by assembling elemental speech units. This is 
achieved by taking into account computed pitch and duration contours, and synthesizing 
a speech waveform. 
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A concatenative synthesis technique was used. The TD- PSOLA scheme enables 
pitch and duration transformations directly on the waveform, at least for moderate ranges 
of prosodic modifications [2] without considerably affecting the quality of the synthe- 
sized speech. 

Diphones were chosen for concatenative speech units as a compromise between the 
size of the unit inventory, the complexity of the concatenation rules and the resulting 
speech quality. A new speech corpus including the most frequent Slovene polyphones 
is being constructed to be included in Phonectic TTS Ver. 3.0. 

4 Testing and Evaluation 

The Phonectic SMS Reader test application has been tested for one month by a chosen 
group of users. The users reported great affinity to the application. One part considered 
the application to be a promising new service enabling sending SMS messages into 
the traditional PSTN network, others saw it as a fun application for sending humorous 
messages to friends. All agreed they liked the application and saw a great potential in it. 

5 Conclusion 

The paper describes a speech technology test application, the Phonectic SMS Reader. The 
application enables sending a new kind of SMS messages through the mobile telephony 
network as well as into the PSTN network. The addressee receives the SMS message in 
form of a synthetically read message. 

A short testing period revealed the application has been receiving a great response 
from the test users who either saw it as a nice toy or as a means for sending SMS messages 
to a traditional PSTN phone user. Thereby we conclude the Phonectic SMS Reader shows 
a great potential to be implemented as a new value-added service in a mobile telephony 
network. 
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Abstract. In this paper we take a second look at current research issues for con- 
versational dialogue systems addressed in [17]. We look at two systems, a movie 
information and a stock information system which were huilt based on the expe- 
riences with the train information system Evar, described in [17]. 



1 Introduction 

Two years ago, at TSD ’99 we presented the Evar system, a conversational spoken 
dialogue research prototype for train information [17]. There we discussed research 
issues for conversational dialogue systems. At the same time hrst plans were made to 
start a spin-off company out of our research group with the aim to market dialogue 
systems. Based on the philosophy presented in [17], a completely new implementation 
of a task- and language-independent dialogue manager was performed. Since then we 
implemented about 10 different conversational systems with applications ranging from 
movie information and movie ticket reservation to quality assurance in an industrial 
environment. Systems were presented at Systems ’99, CeBit 2000, Systems 2000 and 
CeBit 2001. In October 2000 Sympalog received the IST-Prize 2001 by the European 
Community for its conversational dialogue technology. 

We thought this conference would be a good occasion to have a second look at 
some of the implemented systems to see, where research issues raised in [17] were 
addressed and which new issues came up. Of course, we don’t claim that we solved the 
addressed research issues. Also, even though we constantly tested the different prototypes 
of the systems with ‘naive’ users, we cannot yet provide results from an extensive and 
systematic field test. 

We first characterize two of the implemented systems, the movie information system 
Fraenki and the stock information system Stocki (Section 2), then we look at most 
of the research issues addressed in [17] (Section 3-9) and address some additional 
issues (Section 10-1 1). 

* This research was funded by the German Federal Ministry of Education, Science, Research and 
Technology (BMBF) in the framework of the Verbmobil Project under Grant 01 IV 701 K5. 
The responsibility for the contents of this article lies with the authors. 
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2 Two Exemplary Conversational Dialogue Systems 

2.1 Fraenki 

After the implementation of a completely new task- and language-independent dialogue 
manager in the application domain of Evar (train information), we created a dialogue 
system for a typical local information retrieval task, the Fraenki movie information 
system. We chose this way rather than immediately implementing a movie information 
system to learn more about side effects that might come up when porting a system archi- 
tecture to a new application. Fraenki knows the program of all the movie theaters in 
a restricted area (Middle Franconia, i.e. the greater Nuremberg area with approximately 
1.5 mio. people). There are about 60 theaters in 35 locations with a total of about 350 
performances shown per day. The system is hooked up to the public telephone (+49 
9131 16287 and h-49 9131 6166116). Fraenki is a monolingual German system. The 
vocabulary is about 1500 word forms. The program is updated weekly from the Internet. 
New titles are added to the recognition lexicon in a semi-automatic way (see Section 6). 
Typical initial user utterances which can be processed successfully, range from 

I want to go to the movies ^ (will lead to system driven dialogue) 

to 

I want to see ‘Pearl Harbor’ tonight in Erlangen at about eight 

(user driven dialogue, arbitrary combination of information slots) 



2.2 Stock! 

Stocki is a stock information system which knows about stocks listed in the Dax30 
and EuroStoxx50 (the German and European equivalent to the Dow Jones) and the 
Nemax50 (the German equivalent to the Nasdaq). Stocki can answer questions about 
information like stock ID, day’s high, day’s low and traded volume. It knows about 
10 stock exchanges. The about 130 stocks are represented by about 250 variants like 
‘Mercedes’, ‘Daimler’, ‘Chrysler’, and ‘DaimlerChrysler’ for the ‘DaimlerChrysler’ 
company. A typical question to the system is 

What’s the current price and the traded volume of BMW in Frankfurt? 

Stocki is multilingual and can process inquiries and answer questions in German, 
English and Erench. The information is accessed every n seconds or on the fly after the 
user inquiry from a financial service provider. Currently Stocki is only a demonstrator 
and cannot be accessed via the public phone. 

3 WWW Database Access 

In [17] we showed that Evar accessed its information from several information sources 
in the WWW like the German Railways (DE), Lufthansa, and Swiss Railways (SBB), to- 
gether with the facility for a number of local databases to be set-up for regularly-accessed 
data. Evar gathered all the necessary information from the user and only accessed the 
remote databases once (mostly for processing speed). While Fraenki has a local copy 
of its remote database which is updated weekly Stocki has to constantly access the 
remote database to guarantee the most recent data. This situation necessitates a closer 




Research Issues for the Next Generation Spoken Dialogue Systems Revisited 343 

interaction with the information provider than a simple HTML-parser for WWW-pages 
which suffices for a research demonstrator (not to mention the legal aspects which have 
to be cleared). Access speed and data security might make a different interface necessary 
for a commercial system but more important, the interfaces to the WWW databases have 
to be dehned for instance via XML and respective document type dehnitions (DTD files) 
and cascading style sheets (CSS). This allows a clean separation of ‘what’ is represented 
on a WWW-page (DTD) and ‘how’ it is represented (CSS). Otherwise nearly any lay- 
out change in the WWW-page means a change to the HTML-parser for the database 
access which is an unacceptable amount of maintenance. Another important topic is 
the VoiceXML [2] standard. VoiceXML is becoming the voice markup standard for 
Interactive Voice Response (IVR) applications [14]. It allows standardized interfaces 
between the different modules of spoken dialogues systems such as the recognizer, the 
text- to- speech engine, and the dialogue manager. This will lead to an easier integration 
of components from different vendors. 

4 Flexible and Adaptive Dialogue Strategy 

One of the distinctive traits of our dialogue systems is the possibility for the users to 
freely formulate their queries and carry out the transaction quite flexibly. The user is 
allowed to take the initiative regarding the order in which task parameter specification 
takes place and is also usually able to change the current subgoal of the interaction; 
e.g. in correcting a parameter that has already been dealt with, at a time when the 
system is expecting information about another parameter. This contrasts to the more 
common approach of presenting the user with menus to which they have to comply and 
answer with yes or no. As a result, however, there are more possibilities regarding the 
content of the next user utterance, thus increasing the probability of misrecognitions 
and misunderstandings. To remedy this, we introduced a flexible strategy of implicit and 
explicit conhrmation [11]. 

In case of misrecognitions and because of the island driven chunk-parsing approach 
used in the semantic analysis [16], this can lead to the insertion of semantic units. The 
system then tries to confirm a task parameter value which was never uttered by the 
user. In [5] it is reported that users react much more sensitive to the wrong insertion 
of task parameter values than to the deletion of uttered ones. Confidences measures as 
described in [9] are needed to reject an utterance rather than confirm a wrongly detected 
task parameter value. The tuning of when to reject an utterance is very difficult though, 
since it will inevitably lead to a lower understanding rate (sometimes a correctly detected 
task parameter value will be refused). Automatic detection of the optimal working point is 
very difficult and an important research issue. Besides conhdences measures the explicit 
handling of out-of-vocabulary and out-of-domain situations, as described in the next 
chapter, can help very much to increase the understanding rate. 

5 Robustness towards Out-of- Vocabulary Words 



One of the most important causes of failure in spoken dialogue systems is usually ne- 
glected: the problem of words that are not covered by the system’s vocabulary (Out- 
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System: Welcome at the stock information system from Sympalog - 
which information do you want? 

User: What is the current price of Coca-Cola? 

System: Information about Coca-Cola is currently not available. 
Do you need any other information? 



(a) Example of Stock: handling an OOV stock name 



System: Hi, this is Fraenki - which information do you want? 

User: Hi, I want to see an action movie. 

System: Sorry, but currently I am unable to provide you with information 
concerning the quality, content or genre of the movie. 

Do you need any other information? 



(b) Example of Fraenki handling an OOD question 



Fig. 1. Example dialogues involving OOV and OOD situations 



Of- Vocabulary or OOV words). In such a case, the word recognizer usually recognizes 
one or more different words with an acoustic prohle similar to the unknown. These 
misrecognitions often result in possibly irreparable misunderstandings between the user 
and the system. In [13] we presented an approach to directly detect OOV words during 
the speech recognition process and, at the same time, to classify the word with respect 
to a set of semantic classes. This information was used to handle OOV words in the dia- 
logue [8]. In Fraenki and Stocki we extended this notion towards ‘out-of-domain’ 
(OOD) questions which are likely to occur even from a cooperative user: The user might 
ask for an information that is not in the database, like the content of a movie. These 
questions are modeled and the system informs the user that it cannot answer the ques- 
tion. Also, we included some predictable OOV words (like big German cities outside 
the region for which Fraenki has information) explicitly in the recognition lexicon. 
Thus Fraenki can recognize that the user wants to know the movie schedule of a city 
outside of its region either via the ‘OOV City’ word ([13]) or because the user asks for 
the schedule in a city that is close to the region like Regensburg or one of the major cities 
of Germany like Munich and Hamburg. Similarly Stocki recognizes 6 stock exchanges 
like New York and Tokyo, 6 indices like Nasdaq and about 50 stocks listed in other indices 
than the ones in the database like Microsoft and Exxon. Figure 1 shows excerpts from 
example dialogues for OOV and OOD situations. Of course we do not handle arbitrary 
OOD inquiries like ‘Can you bring me a Pizza?’, since we assume a cooperative user. 
The automatic detection of ‘cooperative’ OOD and OOV situations during held tests and 
in routine use of the system without huge amount of human inspection is an important 
research area. 
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6 Multilinguality 

Within the EC Copernicus-project SQEL, a multilingual (Czech, German, Slovak, and 
Slovenian) and multifunctional (airline and railroad connections) version of Evar was 
implemented [4]. Stocki can handle questions in German, English, and Erench using 
the same technology. Speech recognition and language identification is performed in one 
integrated process as described in [18]. The applications of Stocki and Fraenki have 
another multilingual problem, which is especially important for the update of the lexi- 
con: Many stock names and movie titles contain foreign, especially English words; often 
a movie has an English title and a German subtitle and users ask for the film using all 3 
possibilities (English title, German title, and both titles). We thus have the problem (typ- 
ical for many dialogue applications) that the recognition vocabulary changes regularly 
and contains proper names and acronyms from many languages which are expected to 
be pronounced by non-native speakers. The pronunciation e.g. of ‘IBM’ or ‘Carrefour’ 
depends to a large extent on the skill level of the user in the language of the names’ ori- 
gin. This problem has barely been touched in research concerning multilingual speech 
recognition, since practically all research concerns native speakers or non-natives from 
bilingual regions [3]. A lot of research needs to be done both on the level of grapheme- 
to-phoneme conversion and in speech synthesis to handle this difficult situation. For 
Fraenki new titles are retrieved from the internet source once a week, phonetized 
semi-manually by an expert, and spoken by the speaker whose voice is used for a con- 
catenation of prerecorded words and phrases as system output. This is acceptable since 
the movie schedules only change once a week and since the manual work normally only 
concerns a few new titles. However, only switching to TV program rather than movie 
theaters makes this approach unacceptable because of the huge amount of manual work 
and a different approach needs to be taken (see [19] for our first experiments concerning 
the pronunciation of foreign words in the Fraenki scenario). Even though there is 
significantly more manual work involved, current ‘off-the-shelve’ synthesis performs 
far too poorly to be an alternative to concatenation of prerecorded words and phrases in 
the Fraenki/Stocki scenario. 



7 Stochastic Methods for Semantic Analysis 

In [17] we argue that statistical methods need to be explored for semantic analysis. While 
we are still absolutely convinced that this is the long term way to go, current statisti- 
cal methods for semantic analysis require too much training data and don’t generalize 
enough across applications to be used for fast prototyping. One can imagine though a 
hybrid approach from stochastic methods as described in [15] and linguistic methods as 
described in [7] for semantic analysis: a stochastic module can score competing hypothe- 
ses from a knowledge based linguistic module and thus decide on the order in which 
the semantic hypotheses are processed. Another possibility, that we currently look at 
is whether a stochastic module can be used across applications to detect uncooperative 
user utterances. Furthermore, it is an open and fascinating research topic whether it is 
possible to detect changes in the users’ behavior over time, using stochastic semantic 
analysis and unsupervised learning techniques on log-files of running systems. Since the 
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client of a dialogue system is always interested in the effectiveness of his system and 
since evaluation requires a lot of expensive handwork by IT-experts, we believe that this 
is an important question. 

8 Integrated Recognition of Words and Boundaries 

In [17], we propose the direct integration of the classification of phrase boundaries into 
the word recognition process. HMMs are used to model phrase boundaries, which are 
also integrated into the stochastic language model. The word recognizer then determines 
the optimal sequence of words and boundaries. The approach is described in detail in 
[12] and is used in the semantic analysis of our systems: a word sequence containing 
a boundary is less likely to represent a task parameter value than one without. 

9 User Emotion 

In [ 1 7] we argued that it is important to identify a situation where the user gets angry in or- 
der to initiate an appropriate reaction, such as referring the customer to a human operator 
or starting a clarification sub-dialogue. This subject remains to be a topic of fundamental 
research and of growing interest [10]. In [6] it is shown in a WOZ-scenario that not only 
acoustic/prosodic parameters have to be exploited but dialogue structure/history as well; 
for instance, plain repetition of the last utterance vs. rephrasing can be an important cue 
to the detection of anger/frustration. 

10 Multimodality 

Stocki is a demonstrator system. When installed at a direct brokerage bank, questions 
like access control become important, especially if the functionality is extended to stock 
trading. Current voice based verification technology is far too error prone to be accepted 
by any bank and PIN or password input via voice might be unacceptable in a public 
environment. Touch-tone (DTMF)-based PIN input is an example where multimodal 
input is a necessity. The fact that a PIN might be spoken from a client driving a car but 
not be typed in via DTMF in that situation also demonstrates the necessity to flexibly 
offer several input modalities. If one thinks of unified message systems, then multimodal 
output is just as important. For instance, a user might ask for tasks like 
give me the current price of all the car manufacturers 
and fax them to my secretary. 

Similarly, a Fraenki user in the near future might want to see a preview of a movie, if 
s/he has a UMTS -phone. We believe that speech will only be an integral part of future 
human-machine-interaction, but that spoken dialogue will be ‘central control’ of such 
an interaction (see [1] and [20] for a project on dialogue based multimodal human- 
technology interaction). 
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11 Various Topics 

In this section we want to address some topics that came up during the implementation of 
the various systems that are important for future systems and were not mentioned in [17]. 

11.1 Barge In 

For a conversational system it is absolutely inevitable to have a sophisticated barge 
in capability and robust noise cancellation. Especially in the Fraenki scenario we 
experience many calls from non-home/office environments, i.e. bars and public places 
with significant background noise. 

11.2 Rapid Prototyping 

Fraenki was built from scratch in two months using the existing recognition engine and 
dialogue manager. Due to the meanwhile available Sympalog toolkit, other prototypes 
were implemented within a couple of days. Fast porting to new domains is very important 
for ‘real life’ systems and has enormous consequences especially for the methods in 
semantic interpretation and dialogue modeling; for instance methods that require a lot 
of hand encoding of linguistic knowledge for each lexical item are not feasible. 

11.3 Upscaling 

The ability to handle n calls in parallel is no research issue but has large consequences for 
the system architecture and the use of resources. Questions which have to be addressed 
include timing issues (real time speech input, speech output, speech recognition, and 
database access) and redundancy towards hardware failures. 

12 Conclusion 

In this paper we presented two state-of-the-art conversational dialogue systems. We took 
a second look at the research issues raised in [17]. It turns out that most of the research 
issues are still more than valid and that automatic learning methods lack robustness to- 
wards insufficient data to be used in rapid prototyping for new systems. Speech synthesis 
is an Achilles’ heel for telephony based dialogue systems, since it is THE interface to 
the user and since proper names and foreign words are not pronounced well by current 
synthesis systems. VoiceXML will be an important part of future conversational sys- 
tems and multimodality will become increasingly important with strong technological 
changes in the mobile communication ahead of us. 
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Abstract. The Internet offers remote access to many information systems to users 
independent of time and location. This paper describes an agent based approach to 
deal with issues that rise from the differences between user interfaces to function- 
ally similar information systems. The approach yields agents that interact with 
user interfaces to extract the model of the underlying information system. The 
model is represented by labeling input fields of the interface with their meaning. 
The agents are modeled after human dialogue behavior and human computer inter- 
action. The interaction agents are capable of independently and robustly querying 
the information system without explicit instructions. Two types of agents are im- 
plemented and tested on European sites of rail travel planners. The agent design 
and test results will be reported. 



1 Introduction 

Information systems are often available independent of where a user is and what time 
it is. Remote access to such information services, independent of time and location, is 
often facilitated by information providers through offering their service via the Internet. 
Well known examples of such services are rail travel planners. Almost all European rail 
companies have made their schedules available on this medium. 

The web pages that give access to the rail planners all offer the same functionality, 
but all interfaces differ somewhat. There is no “de facto” standard for the look and 
feel of this kind of interface. These differences make offering information services in an 
alternative way through e.g. wireless devices or through a speech interface more difficult 
than necessary. Also creating a system where multiple similar services are integrated is 
hindered. 

This paper is about bridging that gap by using agent technology to assist with creating 
a unified interface to similar systems. The basic idea is that agents start an interaction 
with WWW-sites. This dialogue is aimed to result in knowledge about the underlying 
structure. Our agent is modeled after human behavior (see [1,2]). 

2 User Interfaces 

Many information systems can be accessed in different ways. Some can be assessed as 
web pages on the Internet using a web browser, others can be accessed using a telephone, a 
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usable voice interface might have been defined in voiceML. Recently the wireless markup 
language (WML) has become popular. Information systems can thus have different 
multiple interfaces that each use different means of communication. To make it possible 
to automate access to similar systems without having to use different specifications and 
procedures for each different interface, we need a way to describe what information an 
interface requires to perform its task. Such a user interface description language will 
be described in the section 6. With formal methods a system can be described using 
a precise and semantically sound language. Geert de Haan in [3] describes a model to 
design user interfaces based on the knowledge a competent user likely has. The model 
is called ETAG (Extended Task- Action Grammar). 

Describing the human computer interaction enables testing an implementation against 
its specification. Also functional requirements can be checked before implementation. 
Another reason for describing HCI is that it enables (formal) verification with e.g. tem- 
poral logic. 

Great effort has been put in the HCI field of research to formalize the definition of user 
interfaces. This has enabled requirement testing and eased building correct interfaces. 
Also valuable work has been done on unifying interface description languages. An 
example is UIML (User Interface Markup Language). 



3 Rail Planners 

There are many services on the Internet that offer rail planners. In this section we will 
analyze these planners in order to make a classification based on the similarities and 
differences. An Internet rail planner is a web interface to a program with which train 
schedules can be searched. On top of this minimal service, many rail planners allow 
searching for the best connection between two locations or retrieving pricing information. 
A rail planner interface exists of several entry forms and result pages. All services use 
forms to request travel data such as departure station and travel time from a user. Most 
services have one of those pages. We’ll call these pages start pages. If a user forgets to 
specify a station, enters an invalid date or otherwise offers invalid or incomplete data, 
most planners present a page that shows what was wrong. Such pages usually contain a 
form with which the errors can be corrected. These pages are called intermediate pages. 
If all data was entered correctly, one or more schedule pages are shown. These pages 
hold the result of the query and are therefore called result pages. 

3.1 Visual Characteristics 

A rail travel planner helps a user with creating a travel schema given a time and date, 
a departure station and an arrival station. Figure 1 shows one of the forms of the Dutch 
rail travel planner. The visual characteristics are all about how forms look, the order 
of fields and which fields are used for data entry. We can consider these input fields as 
the expressions for the interaction dialogue. We’ve found the following characteristics: 
Station Selection can be done by pull down list and by free text box. Time can be entered 
in a text field for hour (24) and minute each; in one text field, hour and minute separated 
by a colon (:) or in two pull down menus for hours (24) and minutes (per 15 minutes). 
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Fig. 1. Dutch rail travel planner 



Dates can be entered by separate fields for year, month and day, and one field for these 
three values separated by a blank space or a dot (.) or a dash. In a left to right, top 
to bottom sense, many of the interfaces ask for departure station first, then for arrival 
station. Afterwards the date is asked for and finally time can be entered. All planners 
except the Portuguese planner, offer a toggle with which can be selected whether the 
specified time is a departure or an arrival time. The Portuguese planner does not allow 
for a time to be specified at all. In almost all cases, date and time are pre-filled with 
the current date and time. This might give the user a hint as to where what needs to 
be filled out. All planners make use of labels: un-editable text fields or images that 
indicate what needs to be filled in a nearby input field. Image labels can contain icons 
(little images depicting a certain meaning) and text. Recognizing such images is well 
beyond the scope of this paper. Text labels contain valuable indicative information, but 
are language dependent and irregularly placed and will not be used. Some planners offer 
special aides i.e. a “today” toggle so that no date needs to be entered. The Dutch planner 
even offers a “tomorrow” toggle. 



3.2 Technical Characteristics 

In the first web browsers, HTML forms were rather regular and well defined. But in 
order to meet user demands many additions to forms have been introduced. Because of 
this mechanically interpreting forms has become a challenge. Different sorts of scripting 
extensions were added, the use of multiple submit buttons or alternatives to those buttons 
(e.g. images) was enabled and sessions can be used now to easily divide forms over 
multiple pages. We have cornered a number of these additions to help us find out which 
techniques are needed to deal with HTML forms. Interacting with an information system 
is sometimes done more efficiently if the interaction is split up in several stages. In HTML 
different stages can be realized using multiple forms. None of the travel planners we 
checked use sessioning. After a user has filled out a form, he can submit it by pressing the 
submit button. Many interfaces now offer help with filling out forms. Access to this help 
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is sometimes offered through additional submit buttons. These buttons do not actually 
“submit” the form, but are used to navigate. These navigating buttons are difficult to 
distinguish from the real submit button. They only differ in label and name which are 
generally language dependent characteristics. 



4 Interface Knowledge 

When human users interact with a user interface they find out where to enter a date, how 
to enter a time etc. quite easily, even if they don’t master the used language. We assume 
that this is because human users can recognize the layout of a page based on experience 
with other interfaces, and have common sense and knowledge of the information they 
need to enter. Users that possess the required skills and knowledge can learn how to deal 
with alternative interfaces to rail planners. Experience with other interfaces may have 
taught a user that form fields of a certain size usually indicate that the requested data 
has a length that suits it. One of our goals is to extract the meaning of the parts that an 
interface is made of, such as free entry fields, check boxes and pull down menus. 



5 SMART 

Our goal was to find the meaning of the HTML fields on an interface form. We can 
assume that this meaning cannot always be extracted from the form, e.g. in the case that 
labels and other explanatory material are given in an unknown language. To find out the 
meaning independent of the layout, coloring and labels of a form, we chose to interact 
with the interface. Such interaction is a process guided by rules. These rules allow us to 
assume certain meaning in a yet unknown interface. We can then test our assumptions 
and change them if necessary until we find one that is correct. SMART consists of an 
HTML Framework, a module for result evaluation, an agent and an interface between 
these three parts (see Figure 2). In section 8 we will talk about agents. Agents contain 
the logic to take decisions about how to deal with an HTML interface best. 




Fig. 2. Overview of the four parts in SMART 
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6 Interface Structures 

To help extracting the meaning of the parts we will analyze the knowledge that is needed 
in this section. In order to use a web interface to a rail travel planner a user needs 
knowledge about traveling by train and skills for using web interfaces. In order to enable 
the use of this knowledge, a representation must be made to specify it and enable it. This 
section introduces such a representation for start pages. Start pages are described using 
Structures. Structures can express either exactly which input items a user interface has 
or they can define which combination of Structures describe this. The former type of 
Structure is called Primitive, the latter is called Combined. 

With Primitive Structures we can specify which individual values have fields in a start 
page. Such a field could be a field for departure time, if the form has one field to enter 
time. It could also be a field for hour if the form has a separate field to enter a value for 
hour. 

Combined Structures can be used to specify which field combinations exist on a form. 
An example combination is “time”. To fill out the time in a form a combination of two 
HTML fields for hour and minute are required. Such combinations can be specified as 
Combined Structures using the input and choice fields. In a Combined Structure it can 
be stated how many and which (sub-) Structures are expected. It can also be stated which 
alternatives are available. The HTML Framework can use this Combined Structure to 
search for the Primitive Structures for year and yearshort. The Primitive Structure for 
year was introduced in the previous paragraph as an input field with default value “2000”. 
The yearshort Structure could be similar but have “00” for a default value. Combined 
Structures can be also used to define combined values. A combined value is a value that 
is built up from more than one value. A form that requires such a combined value has 
more than one input field for the value. Time in many forms is a combined value - there 
can be fields for both minute and hour or for one for minute:hour and a selection for 
am/pm. 



6.1 Configurations 

A Structure defines which input is expected, while a Configuration shows how this input 
is expected. A Configuration is a list of Primitive Structures that each represent data that 
is to be entered somewhere in one field on an HTML form. Configurations are deduced 
from Structures by selecting and combining the options of the Structure. Creating a 
Configuration from a Combined Structure can be seen as a non deterministic rewriting 
system that rewrites the Combined Structure to a string of Primitive Structure using 
rules. 



6.2 Assumptions 

After a Configuration is selected, we have explicitly stated how many fields we are 
expecting, what type of fields these are and what their default values are. The next step 
to take is to find a suitable HTML field on a form for each of the Configuration parts. 
A mapping of these parts onto the fields of the HTML form is called an Assumption. 
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6.3 Leftovers 

An Assumption need not to cover all HTML elements in a form. The elements that 
are left over after filling out a form using an Assumption are called Leftovers. Some 
interfaces may require that all elements are filled out however. Since the interface we 
are expecting should he matching our starting structure we can assume that Leftovers 
are irrelevant for the outcome so we use default values for Leftovers. 

Summarizing, the steps to find out what fields on an HTML start page mean are the 
following: select a Start page, select a Start Structure, select a Configuration, select an 
Assumption, select a Leftover, submit the Start page and test the result. 

7 HTML Framework 

The HTML Framework was built in order to send an HTML form, extract forms from an 
HTML page, read structures, rewrite structures to configurations, combine a Configura- 
tion and a Form to an Assumption, combine an Assumption and a form to Leftovers, fill 
out a form based on a Leftover and Assumption. Our parser should be capable of parsing 
HTML Forms. The Java 2 vl .3 SDK offers a call back based HTML 3.2 Parser that was 
reused by rewriting the call back methods. The result evaluation module assu-mes that 
results are presented in tables. It also assumes that checks for existence of a structure 
representation and relative location of such representations are sufficient. Detailed infor- 
mation about the implementation is beyond the scope of this paper and will be published 
elsewhere. 

8 Rail Planner Agents 

In this section interaction with an HTML interface will be used as a means of determining 
which HTML fields represent what. Agents are used because they have advantage that 
they can do the interaction automatically. Two agents are designed. 

The brute force agent will check all possible ways to fill out an HTML form based 
on a Structure. The steps this agent takes are the following. First a start page is selected. 
Then all configurations are selected one at a time. For each configuration all Leftovers are 
selected, one at a time. Finally all Configurations/Assumptions/Leftover combinations 
are selected and the result is tested until a stop condition is reached. For a real life 
information system such as a rail planner this agent is not very efficient. But if there is 
a solution and the tests are defined well, the solution will be found. 

The push agent starts checking all possibilities, but tosses away Assumptions that are 
unlikely to be correct and sorts Configuration so that those that match best are tested. In 
the initialization phase, the HTML Framework is set up, the Structures being initialized, 
a Start page is selected and some variables are set. The Structure that describes the 
expected interface is chosen and its Configurations are calculated by the Framework. 
A flow diagram of the push agent is displayed in Figure 3. 
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Fig. 3. Flow diagram of the push agent 



9 Test 

A testing application has been defined that request a date and time. The application 
is programmed as a Java Server Page. A JSP is a combination of HTML code and 
a Java program. The date is requested as two numbers, the time is also requested in 
two separate fields. From the analysis of the test run it proves that after several startups 
messages, the agents indicates it is searching for date.time. Afterwards it chooses the 
first Configuration that the HTML Framework offers for this Structure. It then calculates 
all possible Assumptions for the starting page. Each Assumption consists of field-value 
pairs. Then the Leftovers are constructed by the Framework. One of the found Leftover 
consists of the fields year, month and minute. They are filled with an empty value. After 
less then 20 rounds of testing Assumptions and pushing down unlikely candidates, the 
tested Assumption meets all the test. The agent concludes that time and date were found 
and the stop condition is reached. We tested our system on the Internet based rail travel 
planners from the Netherlands, Germany, Czech Republic, Italy, Portugal, Denmark and 
France. We note that our test, of date and time is language independent. It proves that 
our agent found the requested information without explicit knowledge of the language 
of the 7 countries and structure of the Internet pages. 



10 Summary 

In this paper a system was described that realized this by extracting interface semantics 
through determining the meaning of input fields of interfaces. In order to make the ex- 
traction of semantics feasible we have restricted our research to Internet based rail travel 
planners. A corpus of seven rail planners was created. We modeled human computer in- 
teraction based on the analysis of these planners. A notation that can be used to describe 
start pages based on their functionality was defined. A framework to use the start page 
description language to fill out a start page form and to retrieve the results was created. 
A testing module for testing result pages was implemented. Two example agents for 
extracting user interface semantics were developed. These agents are programmed as 
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expert systems that have rules to take action based on intermediate and result pages. A 
testing environment was created. 
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Abstract. In this paper we present an adaptive architecture for interaction and di- 
alogue management in spoken dialogue applications. This architecture is targeted 
for applications that adapt to the situation and the user. We have implemented the 
architecture as part of our Jaspis speech application development framework. We 
also introduce some application issues discovered in applications built on top of 
Jaspis. 



1 Introduction 

Speech-based communication can differ greatly between individual users and situations. 
For example, some people prefer that the computer takes the initiative, but others may 
prefer a more user-initiated style of communication. Speech is also very language and 
culture dependent and differences between user groups can be large. In order to construct 
efficient speech applications we need interaction methods that adapt to the different users 
and situations. 

In this paper we present an advanced architecture for speech-based interaction and 
dialogue management. We use dialogue agents and dialogue evaluators for adaptive dia- 
logue handling. Together with input and presentation frameworks they form a foundation 
for adaptive interaction methods. Based on these components we can build adaptive and 
reusable interaction methods for speech applications. This architecture is included in the 
Jaspis speech application development framework which is described next. 

1.1 Jaspis Architecture 

Jaspis is a general speech application development architecture. It is based on Java and 
XML. Jaspis is a freely available framework to support the development and research of 
spoken language applications. Jaspis is described in detail in [4]. Here we focus on the 
features that allow for flexible interaction and dialogue management. A brief explanation 
of the key components of Jaspis follows. 

Jaspis contains two features that are needed in adaptive applications. First, all in- 
formation is stored in a shared place. This blackboard-type shared memory is called 
information storage. A conceptual, language independent form to present information is 
used whenever possible. This is a key feature to adapt applications to different situations. 
All system components store all information in the information storage. In this way each 
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component can utilize all the information that the system contains. Using shared infor- 
mation it is possible to utilize information such as dialogue history and user profiles in 
every single component. 

The second main feature is the flexible interaction coordination model. This is re- 
alized using managers, agents and evaluators. The agents are system components that 
handle various interaction situations, such as speech output presentations and dialogue 
decisions. The evaluators attempt to choose the best possible agents to handle each 
situation and the managers coordinate these components. 

Next we will introduce the general interaction architecture followed by a more de- 
tailed description of the dialogue handling components. After that we present some 
applications issues discovered in the applications written using the Jaspis architecture. 
The paper ends with discussion and conclusions. 

2 Adaptive Interaction Management 

Speech based dialogue can be divided into three parts. In a speech application we need 
to receive inputs from a user, carry out actions according to the user’s inputs and present 
the response back to the user. Therefore, we need three kinds of interaction handling 
components. We must use some kind of input handling component to deal with the user 
inputs, have a dialogue component to keep the conversation going on in a meaningful 
way and also have a component for output presentation. 

In adaptive applications we can have competing interaction strategies for the same 
tasks and also complementing strategies for different tasks. Since the interaction is not 
necessarily based on sequential control, we must have a way to handle it in a more 
flexible way. All these needs yield some kind of coordination, selections and evaluations 
of different possibilities. 

In Figure 1 we present our approach to adaptive interaction management. We use 
agents, evaluators and managers to represent the interaction techniques and their coor- 
dination. All three interaction sub-models consist of these components. The basic idea 
is similar for all models: interaction agents are used to present interaction techniques, 
evaluators are used to determine which agents are suitable for the different situations 
and managers are used for overall coordination. The only difference is that in the dia- 
logue and presentation models the selection of an interaction agent takes place before 
the agent handles the situation, but in the input model input evaluators operate after the 
input agents have operated on and ranked the results. Next we introduce how Jaspis 
implements the interaction coordination described above. 

2.1 Managers 

The managers control all interaction tasks. They can have any number of individual 
tasks, but they all share two common tasks: they use evaluators to determine how each 
interaction technique, i.e. an agent, suits the current interaction situation and then select 
one particular agent to take care of the situation. It is up to each manager to decide how 
it uses the results it gets from the evaluators. 

In the current Jaspis implementation managers also decide when they can handle the 
interaction on the top level (i.e., which manager should handle the situation). They use 
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evaluators for this task too. Basically, managers check if their agents are able to handle 
the current situation and how successfully they are able to do it. 

The interaction manager controls the interaction on the highest level. It should be 
aware of the other components in the system and control how they carry out the inter- 
action. An application developer can specify exactly how the interaction management 
should be done. For example, we can use scripts to add domain specific behavior to the 
interaction manager. This kind of approach is utilized in the GALAXY-II architecture 
[3]. We prefer autonomous sub-components which determine themselves when they are 
able to handle the interaction. This allows for more flexibility and adaptability for the 
interaction. Jaspis includes a manager switching scheme where an opportunity to react 
to events is offered to managers based on their priorities. 

2.2 Interaction Agents 

The interaction agents implement the actual interaction techniques. Agents are often 
understood to be mobile, intelligent and autonomous. We do not use agents in that 
sense: they can be intelligent, and they are autonomous in a sense but also controlled in 
a way. Mobility is not an issue here. 

Agents are specialized for specific tasks, such as error handling or presentation of 
tables. The fact that agents are specialized makes it possible to implement reusable and 
extendable interaction components that are also easy to write and maintain. For example, 
we can write general interaction techniques, such as error correction methods, to take 
care of error situations in speech applications. Very simple and generic agents for error 
handling are ones that ask the user to repeat their input if it was not understood. This 
process actually requires three agents: one that determines that no input was recognized, 
one that decides that we should ask the user to repeat the input, and one that generates 
the actual output message. 

Although agents can be specialized for different tasks, we can also have different 
agents for the same tasks. In this way we can support different interaction strategies inside 
an application in a modular way. Because of this, an application can adapt dynamically 
to the user and the situation. For example, we can have different agents to take care of 
speech outputs spoken in different languages. We can take the agents from the example 
in the previous paragraph and write a presentation agent that now asks the user to repeat 
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the input using a different language. We have now the same error correction procedure 
in a new language. 

The set of attributes the agents have indicate what they can do and how well. For 
example, we can have attributes for the languages that a presentation or an input agent 
can handle. Based on the agents’ attributes and information about the current situation 
and user, we can determine if an agent is suitable to handle the given situation. Some 
of the interaction agents may be totally application independent and some are closely 
tied to the application and/or to the domain. Application independent behavior is usually 
preferred so that the same components can be used in different applications. 



2.3 Evaluators 

A manager uses evaluators to evaluate all available agents and chooses the most suit-able 
one. The basic idea is that a manager uses evaluators to compare the different attributes of 
agents and chooses the one recommended by the evaluators. Evaluations are separated to 
small independent components (the evaluators) so that we can write reusable evaluators 
and keep the potentially complex evaluations manageable. 

The selection of the most suitable agent is based on the overall situation which 
may include things such as current inputs, dialogue context and user preferences. The 
evaluation can depend on the interaction task at hand. Different evaluators are specialized 
for evaluating different aspects. For example, one evaluator can evaluate some specific 
aspects of agents, such as the ability to process certain input modality. Other evaluators 
can deal with more general issues, such as if a particular dialogue agent suits the overall 
dialogue history. Evaluators can also be application independent or tied to the application 
domain. By using application independent evaluators, we can write reusable components 
and when necessary, new application specific evaluators can also be written or existing 
evaluators can be extended. 

When evaluators evaluate agents, each evaluator gives a score to each agent. Each 
agent therefore receives several scores. Multiplying the given scores together yields the 
final evaluation. As the scores are between 0 and 1, the final result also falls in this range. 
If an evaluator gives a “0” score to an agent it means that the agent is not capable of 
handling the situation and “1” means that the evaluator sees no reason against using the 
agent. This simple scheme can be extended to include scaling factors and more complex 
functions when needed. 

3 Adaptive Dialogue Management 

The dialogue manager takes care of the overall communication between the user and the 
system. The input and presentation managers handle interaction tasks in detail. In most 
of the current speech systems dialogue management is handled by a single component 
called dialogue manager. We believe that this monolithic approach is not ideal for speech- 
based or other applications that are highly context sensitive. In order to utilize reusable 
interaction components and to adapt to the user and the situation we need more flexible 
methods for coordinating the dialogue. 
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3.1 Dialogue Agents 

The dialogue agents form the basic communication units between the user and the 
application. Each time the dialogue manager takes control, a single dialogue agent is 
activated and it decides what action the system is to take. The next time the dialogue 
manager is activated the selection of an agent is repeated. Another agent can continue 
the dialogue in the next turn, but the idea of agents is that a single agent is specialized 
in a specihc situation in dialogues. Therefore, a dialogue flow is a sequence of dialogue 
units formed by the dialogue agents. As with agents in general, different dialogue agents 
are supposed to be suited for different situations. Each of them has a set of capabilities 
that reflects what this particular agent is suitable for. In a very simple case there is only 
one agent suitable for a given situation. In order to provide more flexibility and choices 
in the interaction we can add alternative agents with varying behaviors for the same 
dialogue situation. 

Dialogue management strategies have been widely studied and different kinds of dia- 
logue management approaches have been proposed and evaluated. Eor example, accord- 
ing to Walker & al. [6] mixed-initiative dialogues are more efficient but not as preferred 
as system-initiative dialogues. They argue that this is mainly because of the low learning 
curve and predictability of system-initiative interfaces. However, system-initiative inter- 
faces are more inefficient and could frustrate more experienced users. Thus, both kinds 
of dialogue handling strategies are needed and should be used. Attempts to utilize both 
mixed-initiative and system initiative approaches have been carried out. Eor example, in 
[1] short-cuts are used to provide the benefits of system-initiative and mixed-initiative 
dialogue control strategies in a single interface. 

Erom the viewpoint of an application developer, dialogue agents can also be used to 
implement different dialogue control models. Different dialogue control models (such 
as state-machines and forms) have different capabilities and benefits. In complex ap- 
plications we may need several control models. Attempts to utilize several dialogue 
controlling models have been introduced for example to combine state-based and form- 
based dialogue control models. The dialogue management architecture of Jaspis provides 
explicit support for alternative dialogue strategies, dialogue control models and reusable 
dialogue components. 

3.2 Dialogue Evaluators 

When the dialogue manager is selecting an appropriate agent for a situation, it uses 
dialogue evaluators to carry out the actual evaluation. Dialogue evaluators are specialized 
for different kinds of tasks. The simplest evaluator, the capability evaluator, just checks 
which dialogue agents can handle the current input. Often only one dialogue agent is 
found and the selection is trivial. However, in some cases there are many competing 
agents available. This is especially the case when input agents have indicated that an 
error has occurred and an error handling agent is needed. Since there exist agents with 
different kinds of error handling strategies we must find the most suitable one to solve 
the error situation and let it control the situation. 

Another example of a basic evaluator is the consistency evaluator. Since it is impor- 
tant to keep some kind of consistency in the dialogue flow, we must check that the chosen 
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dialogue agents are coherent over time. This prevents the dialogue manager from switch- 
ing to different styles of agents at every dialogue turn. The consistency evaluator uses 
the dialogue history to determine how well each agent suits the overall dialogue flow. 
As changing the dialogue style is sometimes useful, the consistency evaluator should 
allow complete changes in dialogue style, but it should not allow such changes to occur 
all the time. However, this behavior can be application and domain dependent and an 
application developer should inherit this evaluator and add application specific behavior 
to it. 



4 Dialogue Management in Practical Applications 

We have used Jaspis to construct several spoken dialogue applications in different do- 
mains. These applications include an e-mail application Mailman (Postimies) [5], a local 
transportation guide Busman (Bussimies) and a ubiquitous computing application Door- 
man (Ovimies). Next, we introduce some experiences based on the implementations of 
mentioned applications. 

In form-based dialogue management for database retrieval tasks, multiple agents 
give several advantages. Many of the special situations can be implemented in their own 
dialogue agents. This makes the main dialogue manager agent simple and maintainable. 
Basic form-based dialogue management can be implemented in a domain independent 
manner, so that all domain specific information is in configuration files. The basic al- 
gorithms are reasonably simple. When dialogue management for special cases is imple- 
mented in several special agents, most of these agents can be also domain independent. 
In fact, even if the main dialogue agent is domain specific, the other agents can still be 
domain independent. Still, it is possible to develop domain specific agents for the special 
situations when domain knowledge is needed. 

Basic form-based dialogue management could be split into following dialogue agents. 
The main dialogue agent handles the situations when the system is responding to a legal 
user input that defines a query. This agent simply checks if a resulting form is a legal 
query. If so, it makes the query, otherwise it asks the user to give more information. An- 
other agent can take care of situations when we have received a result from the database. 
This agent informs the user of the results. If the query result set is too large to be pre- 
sented to the user, a special agent takes control and forms a question to the user to restrict 
the query. If the database result set is empty, we can have a special agent to modify the 
query. Ambiguity resolution can also be seen as a special agent. When user input is 
ambiguous, this agent steps in. The last three agents mentioned require some domain 
specific information, namely information about the query keys and concepts of the sys- 
tem. If one is developing an information retrieval system, the domain specific program 
code could be restricted to these three components. This way porting the system to a 
new domain would require updating only these pieces of program code. 

Other, more special cases, suitable for specific dialogue agents include different error 
handling situations and requests for the users to repeat information etc. If the user asks 
the system to repeat something that was just uttered, there can be a completely different 
dialogue agent to handle this. All we need is some kind of dialogue history where we 
can find whaf the state of the dialogue was before, so that we can reproduce our output. 
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If the dialogue history is stored in a sufficiently generic form, the agent handling repeat 
requests can be completely independent of the other agents. Error handling situations 
can also be specific agents. For example, cases where the user informs the system that 
something went wrong, e.g. speech recognition made an error, specific agents can take 
care of the possible correction sub-dialogues. As far as it is possible to recognize user 
utterances that are out of the applications domain, they could be handled in their own 
agent, or a set of agents. 

When a state-based dialogue is implemented using multiple agents, there are several 
possibilities on how to split the task between them. One way is to implement a single 
agent for each state of the state machine. This way we can include complex processing 
into all of the states. Another way is to build a single generic state based dialogue agent 
that just reads the state transition definitions from a configuration file. In yef another 
solution we can implement special states as separate agents and include the ones that do 
not require any complex processing into the single generic agent. In principle, we can 
also use multiple copies of a single agent but with different configurations, so that each 
state has its own instance of the same agent. This is a special way to produce the generic 
state based dialogue agent and the implementation could include very elegant program 
code. If we want, we can also split a single state into multiple agents, so that different 
agents would handle the different types of inputs in each state. This would cause us to 
have very compact agents, but obviously a lot of them. 

It is also possible to construct completely domain dependent dialogue management. 
When carefully designed, separation to different agents eases the development signih- 
cantly. Even if these components can seldom be reused without any modifications, the 
agents themselves stay much smaller and therefore easier to develop and maintain. The 
main advantage is that when a well-designed application is further developed, each agent 
can be developed without changing the other ones. Expanding the system can also be 
easier since new functionality can be added by just adding new agents. 



5 Conclusions and Future Work 

In order to build adaptive speech applications we need adaptive interaction and dialogue 
handling methods. We have here presented an advanced model for interaction and di- 
alogue management to support adaptive speech applications. Our architecture includes 
adaptive interaction and dialogue management in the form of interaction and dialogue 
agents and evaluators. They make it possible to construct reusable interaction manage- 
ment components. Our solution has some common features with the work presented in 
[2]. In some sense our dialogue agents can be compared to handlers in Rudnicky’s and 
Xu’s architecture. However, our solution focuses more on the adaptive and alternative 
issues. It is also noteworthy that our architecture does not make assumptions on how the 
actual dialogue control strategies should be implemented. 

The presented interaction and dialogue models are implemented in the Jaspis archi- 
tecture and used successfully in various practical spoken dialogue applications which 
use different dialogue management strategies and control models. In the future we will 
extend our work to cover new applications areas dealing with mobile and pervasive 
computing. In particular we will extend our framework to cover issues such as multiple 
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simultaneous dialogues with multiple participants and the use of context and locations 
awareness. 
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Abstract. Web search has become an everyday activity for millions of people. 
Unfortunately, there is a number of well known problems associated with it. Two 
of those problems are to express an information need as a set of search terms and 
to actually present only the most relevant matches for this query once the search 
has been performed. This paper tackles those problems for limited domains where 
the amount of data might be large, but the application of some fairly simple 
intelligent techniques to extract conceptual information is feasible. We describe 
how a dialogue system utilizes the extracted knowledge to guide the search. This 
paper builds on earlier work which focussed on the extraction of knowledge-rich 
indices. 



1 Introduction 

With more than a billion pages on the publicly indexable Web, nobody should be sur- 
prised if a Web search engine returns thousands of matches for a simple query. How 
should the search engine determine which of the matches are most relevant for that 
particular user? However, the problem even arises for smaller domains. Our sample do- 
main, the University of Essex Web site, comprises less that 30,000 documents, yet there 
is hardly any frequently submitted user query which results in less than fifty matching 
documents, more likely in more than a thousand! Part of the problem is that the average 
user query contains less than three words [8], in our sample domain even less than two. 
One solution is to evaluate the initial query, consult some domain knowledge and initiate 
a dialogue that helps the user navigate to the set of most relevant documents. Obviously, 
this dialogue needs to be short and should only be started if necessary. Furthermore, the 
choice of how to get to the data must be left to the user. The dialogue must not dictate 
the next steps. 

For example, a user searching the Essex University Web site for union (a frequent 
query according to the log files) might not be aware of the fact that it is not easy to 
present the best matches since the query is highly ambiguous. In fact, there are Web 
pages in that domain presenting information about the trade union, students union and 
a Christian union. Not just that, but there is a number of pages devoted to discriminated 
unions. That could well be the pages the user expects as an answer if that user is a student 
currently writing an assignment in the Distributed Computing module. 

We argue, that a simple dialogue system which applies certain domain knowledge 
can help the user to find the required information. Furthermore, we argue that such 
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a system-initiated dialogue is useful in domains as just described, e.g. sites of companies 
or institutions as well as intranets, where the amount of data can be quite large, but not 
as large as what is kept in databases of general Web search engines. It should not be 
discussed whether it is feasible or even desirable to apply these ideas to general Web 
search engines. 

A simple way of extracting conceptual information, i.e. the domain knowledge to he 
applied in the dialogue, was discussed in [6]. Unlike standard approaches this method 
does not rely on a given document structure or expensively created concept hierarchies. 

2 Related Work 

A comprehensive overview of related work was presented in [6] . This section is restricted 
to some of the most relevant references only. 

A number of clustering and classification techniques have been applied to the Web. 
For example, clustering is being used for concept-based relevance feedback for Web 
information retrieval [1]. Following a user query the retrieved documents are organized 
into conceptual groups. Unlike in our approach this structure is not extracted for the 
indexed domain but for the search results. Grouper is an interface to a meta-search 
engine which works in a similar way [9] . 

However, for our problem post-retrieval clustering would miss the point since a clas- 
sification structure for the whole data set will have to be built in advance. 

Unlike clustering the idea of document classification is based on predefined cate- 
gories, normally manually constructed, and after a training phase new documents can 
be classified on-the-fly like in [2]. 

Northernlight' is an example of a search engine where the results are grouped in 
dynamically created categories, so called Custom Search Folders which help a user 
narrow down the search in case of too many matches. The search covers the whole 
Internet and thus has access to much more data than what we assume. The hierarchy is not 
acquired automatically from the data, but is maintained manually. Like with any domain 
independent approach, the results can be surprising. The user asking for “intelligent 
Web search ” is referred to a number of Custom Search Folders including “Shakespeare, 
William”, “Blake, William” as well as “Museums & galleries”. 

A recent trend is that search engines tend to offer a classified directory as an alter- 
native or additional source of information. An example is Google^ which incorporates 
the Open Directory^ hierarchy of classifications. Also ontologies and customized ver- 
sions of existing language resources like WordNet [7] are being successfully employed 
to search product catalogues and other document collections held in relational databases 
[5]. Part of that research is the actual construction of ontologies [3]. The cost to create the 
resources can be enormous and it is difficult to apply these solutions to other domains 
where the document structure is not known in advance. However, for broad domains 
such general knowledge sources can be good. It might be argued, that a domain like 
the one discussed in this paper is fairly broad. But experience shows that this is not the 

' http : //www. northernlight . com 
^ http : //www. google . com 
^ http : //dmoz . org 
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case, as the example of the word acquisition demonstrates. There are more than 600 
occurrences in the entire document collection, but only five terms that the automatic ex- 
traction process considered conceptually (most closely) related: language -acquisition, 
first -language, research -group, second-language, psycholinguistics. This is quite differ- 
ent from what a domain independent knowledge source like WordNet would associate 
with the word acquisition. 

The Ypa is a system that addresses a similar problem as described here where a user 
is looking for advertisers that could provide certain goods or services [4] . The documents 
are not Web pages but advertisements from BT’s Yellow Pages and Talking Pages'^. The 
Ypa and the system presented here share some requirements on the dialogue, namely 
that the interaction with the user needs to be as short and concise as possible. 

3 The Domain Knowledge 

The automatic acquisition of a directory structure which functions as a domain model 
was described in [6]. The main motivations are summarized here. A classification hie- 
rarchy is imposed on a set of Web pages by extracting concept terms based on the 
structure detected in documents and relations between documents. The basic idea is 
to select keywords for documents which are found in more than one of the following 
contexts: meta information, document headings, document title or emphasised parts 
of the document (e.g. bold or italics font). Any other text can be ignored altogether. 
This allows us to separate important terms from less important ones. It works largely 
language independent because it exploits markup more than anything else. Furthermore it 
is cost effective. The extracted concepts function as classifications not just for individual 
documents but directories and linked documents as well, similar to those classifications 
in manually maintained Web directories like Yahoo fi. Another important issue is the 
relation between concept terms. Basically, we consider two concepts to be related, if 
there is a document for which both terms were found to be conceptual terms. 

4 The Dialogue System 

We will demonstrate how knowledge created as outlined above can be utilized in ap- 
plications like information retrieval, document clustering or classification. We concen- 
trate on information retrieval, more specifically, the retrieval of documents following 
a user query submitted via a dialogue system. Before explaining the employment of 
the extracted structures, it is worth listing informally some of the requirements for such 
a dialogue system. The dialogue with the user must: 

- allow the user to add or relax constraints if necessary 

- finally present the best matching documents or report that there are no matches 

- be as short as possible (in terms of dialogue steps) 

- present the user a number of choices to continue the dialogue if this is appropriate 

Yellow Pages ® and Talking Pages ® are registered trade marks of British Telecommunications 

pic in the United Kingdom. 

^ http : //www . yahoo . com 
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- present the best possible choices only 

- avoid unnecessary dialogue steps 

This leads to the idea of clustering the documents using concept terms. Simply 
speaking, a cluster is the set of documents described by a number of concepts extracted 
from the source data. This way the automatically created structures are applied in two 
ways: first to construct clusters offline, and second to guide a user in a search task so that 
useful document clusters can be constructed on-the-fly leading to a cluster that describes 
the resulting document set. Quite different from standard information retrieval this can 
be seen as hnding the right set of constraints, i.e. classifications, that match user’s need 
and answer set. Rather than operating on the set of index terms, it is the set of more 
significant and more reliable concept terms that is being used to hnd answers. Of course, 
once a query is specihc enough the terms to be operated on will be ordinary index terms. 

Because of the usual problem of too many matches, one can easily construct numer- 
ous examples where the idea of searching documents for a user query in a top-down 
fashion (i.e. via classihcations) seems more appropriate than just ranking the results re- 
trieved by matching user query against database entries. Consider a Yellow Pages request 
for alarms. Possibly relevant classihcations as found in the Colchester directory are: (1) 
Alarm systems, (2) Burglar alarms & security systems, (3) Car alarms & security, (4) 
Fire alarms, (5) Gas alarms and (6) Intruder alarms. With or without knowing the explicit 
classihcations it is not clear which documents will be most relevant. In this specihc case 
a single dialogue step could establish which of those classihcations the user is after. To 
make the point more general, even if there are no explicit classihcations like the ones 
just listed, it is desirable to create them automatically. 

In other words, standard information retrieval can be seen as a bottom-up (data 
driven) strategy of matching query terms to document indexes, but in contrast to that 
the extracted structure will allow a top-down (dialogue driven) search strategy starting 
at the top level of general classihcations going down to the document level as depicted 
in the example in Figure 1. The top clusters in the hierarchy represent single concept 
terms, clusters further down represent a number of concept terms that describe a smaller 
set of documents. This example gives only a motivation, in reality there is a number of 
overlapping hierarchies rather than just one. 

The hgure gives only a partial picture. So far we have been concerned about the 
classihcations in a directory. But part of a typical classihed directory are cross-references 
relating classihcations to each other, which goes beyond the scope of this paper. 

The implementation of the above mentioned top-down processing of a query is 
fairly straightforward. We will concentrate on the most interesting task, to decide which 
constraining terms should be offered if a query results in too many matches. Figure 2 
depicts this problem. We assume the query to be a set of terms. Let the current query 
describe the set of documents called Cluster 1 which is a subset of our domain. The 
next step is to determine discriminating terms that retrieve subsets of Cluster 1. Related 
concepts are used to do this. All concepts related to any of the current query terms 
are considered as candidates. As a result the dialogue step would involve displaying 
the “best” concepts and the user will choose one or perform some other possible option. 
Because we are only dealing with concepts and not keywords in general, this is a feasible 
task that can partially be performed offline. 




Dialogue for Web Search Utilizing Automatically Acquired Domain Knowledge 



369 







Fig. 1. Two Search Strategies 




Fig. 2. Clustering the Potential Results 



What are the “best” concepts? The example in Figure 2 shows exactly four potential 
concept terms applicable to the current query. Each of the sets referred to as Cluster 
2 ... 5 are the sets of documents defined by adding exactly one of the potential concept 
terms to the current query. Now we consider all those concepts which result in non- 
empty document sets. It is obvious that the created document clusters can overlap. In 
the introductory example the terms students jinion and christian_union are such a case. 
Furthermore, the partitioning of the document set is normally not exhaustive, because the 
number of concepts is fairly restricted. The dialogue should help the user find the actual 
documents he or she is interested in. It seems more intuitive to present some possible 
ways to constrain the query, preferably picking good discriminators, rather than giving 
as many choices as there are documents. From what was said so far, the concept terms 
representing each of the clusters in the example would be possible user choices. However, 
we do not want Cluster 4 and Cluster 5 being presented side by side. Instead, the larger 
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document set is selected as a choice alongside the option to constrain it even further. 
Note, that user and administrator have some freedom of setting the maximum number 
of possible choices. The dialogue system will offer the largest clusters only. Once a user 
has made a choice this process starts again. 

This is a simplified explanation which ignores other options available to the user. 



5 Examples 

The database of index tables is only useful if it can at all be applied in a realistic 
situation. Using some examples, this section describes the dialogue system that exploits 
the knowledge-rich indices automatically created for the Essex University domain. Not 
all user queries actually require the dialogue system. If the query can be matched against 
a reasonably sized set of documents, then no dialogue is necessary. Obviously, reasonably 
sized depends entirely on what administrator and user define it to be. 



yahoo 




search_engine information_service altavista excite 
Fig. 3. Concept tree for example query “yahoo” 



To illustrate the dialogue we pick two of the most frequent queries according to 
the Web server’s log file. We use a setup that initiates a dialogue if a query matches 
more than 50 documents. The most frequent query was (possibly surprisingly so) one 
unrelated to the university: “yahoo”. Figure 3 displays the concepts related to the query 
which are being offered to the user as possible constraints. Of course, the user is free to 
add any other term in the input field provided. Also this figure only displays the choice of 
concepts and not the default choices (e.g. show most significant matches for the current 
query, call the standard search engine at this stage etc.). There is a total of 97 documents 
matching the keyword yahoo, which is why in this setup a dialogue is initiated. The 
number of displayed addresses using the constraining concept terms is as follows: 32 
(adding search_engine), 12 (information_service), 15 {altavista), and 12 (excite). 

As the second example from the top ten most frequent queries we shall pick the 
user request for a “prospectus” as displayed in Figure 4. This is another example that 
displays the significant differences between a domain-independent world model and the 
one extracted automatically. Displayed are the five most relevant concepts (more than 
five related concepts exist, and a different setup would produce more choices). Any 
option the user chooses actually either results in a set of documents to be displayed or 
another dialogue step with more constraining choices as shown in the figure. Note, that 
no manual modification was applied to the examples. 
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prospectus 



request 



student_life 

undergraduate_degree'"'~^^^^ 
postgraduate_study research_study 



request_form i 



mail_post 



research_study 



postgraduate_study 



e_prospectus 

Fig. 4. Concept tree for example query “prospectus” 



6 Recent Experiences 

The sample domain is the entire University of Essex Web site. Currently there are 27,333 
indexed pages. This excludes all those pages that are not in text format. The search system 
has been fully implemented and is now being evaluated. One interesting fact is that the 
dialogue facility does not slow down the system significantly. Many of the computations 
can be performed offline, that is the number of related concepts that exist for a given 
word, their relevance, and even the tree described by a concept, i.e. the exemplified 
dialogues can be performed for all concept terms beforehand. For a limited domain that 
is feasible, it would not be the case for the entire Web. Another fact can be detected by 
analyzing the log files in our sample domain: the users hardly ever use anything else 
than a noun as input. The majority of queries is between one and two words long which 
causes our system to initiate a dialogue step in most cases. 
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Abstract. Recently the technology for speech recognition and language process- 
ing has been improved, and speech recognition systems and dialogue systems 
have been developed to be practical use. However, in order to become practical, 
not only those fundamental techniques but also the techniques of portability and 
expansibility should be developed. 

We have also done the research on usability and robustness until now, but we 
have not yet done the research on expansibility and portability. The purpose of 
this paper is to improve the efficiency of the system construction, focusing on 
portability of spoken dialogue system, by dividing the part with the task inde- 
pendent and dependent of the system clearly. Applying the system to two tasks 
of sightseeing guidance and literature retrieval, we evaluated the efficiency of the 
system construction. We obtained the performance which was equivalent to pre- 
vious task-dependent system not considering the portability on the sightseeing 
guidance task. 



1 Introduction 

Recently, much research has been done on the robustness and reliability of spoken 
dialogue systems. We developed a “Mt. Fuji sightseeing guidance” system which used 
touch screen input, speech input/output and graphical output, and have improved the 
sub-modules of a speech recognizer, natural language interpreter, response generator 
and multi-modal interface [1,2, 3, 4, 5]. However, all of these modules except for the 
speech recognition module depended on a given task or domain. 

As speech recognition systems are increasing by being used in practical applica- 
tions, spoken dialogue systems will also become more widespread. However, the cost 
of developing a new spoken dialogue system is enormous. The systems that have been 
developed so far can not be transferred to other domains easily, and yet a highly-portable 
system that can be easily adapted to another domain or task urgently needs to be de- 
veloped. There are several examples of researches that focused on high portability and 
expansibility [6,7,8,9,10,11]. 

In [6] , a prototype could be simply constructed even in a complicated speech dialogue 
system using the PIA system, which was implemented using Visual Basic. This system 
placed priority on achieving high robustness of speech recognition and high naturalness 
of generated dialogue. However, the system limited the task to the domain of knowledge 
search. The system produced experimentally by the project of REWARD (Real World 
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Applications of Robust Dialogue) [7] allowed the development and debug going of the 
system to be controlled by the developer. This system attempted to implement a spoken 
dialogue system through a telephone line. In the CSLU Toolkit system [8] which was 
developed by CSLU (Center for Spoken Language Understanding) of OGI, the system 
can construct the dialogue’s application using the speech, even if the developers do not 
have any knowledge of the speech dialogue system at all. Furthermore, modules such 
as natural language understanding, speech synthesis and animation of face images as 
well as speech recognition could be easily constructed. However, the parsing capability 
has been limited in that each component can not understand complicated grammar, for 
example. This CSLU system was developed to provide classroom training on speech 
processing. M. Sasajima et al. also proposed a new framework for developing spoken 
dialogue system [9]. In this framework, dialogue control is described by a unification- 
based script handling instruction set language, which is similar to PASCAL. A. Abella 
and A. L. Gorin proposed a systematic approach for creating a dialogue management 
system based on a Construct Algebra, a collection of relations and oprations on a task 
representation [10]. E. Levin et al. reported the design and implementation of the AT&T 
Communicator mixed-initiative spoken dialog system [11]. The communicator project 
sponsored by DARPA launched in 1999. 

On the other hand, we have also considered about the portability of spoken dialogue 
system [12]. We showed from the experience that it was difficult that the Mt. Fuji 
sightseeing guidance existing system was actually applied to East Mikawa sightseeing 
guidance. 

In this paper, our purpose is to build portable spoken dialogue system developing 
tools based on GUI [12]. Especially, we focus on the developing tools of spoken dialogue 
system for database retrieval. We have clearly separated semantics, retrieval and response 
modules into task independent parts and task depended parts. Furthermore, we have 
proposed the techniques automatically and efficiently building task dependent parts. In 
a Mt. Fuji sightseeing guidance system and a literature retrieval system, the performance 
and efficiency of task adaptation have been evaluated. 

2 Highly-Portable System for Information Retrieval 
from Database 

The proposed system in the dialogue processing parts consists of semantics interpreter, 
retrieval module, response generator and dialogue manager which integrates three mod- 
ules. The system overview is shown in Figure 1. 

Each module has the roles as following: 

- Speech Recognizer: The module understands user utterance via speech input and 
generates recognized sentence. We used SPOJUS [13] as this module. 

- Semantic Interpreter: The module understands user utterance via recognized sen- 
tence and generates semantic representation. 

- Retrieval Module: The module that extracts the word as retrieval key from obtained 
semantic representation, and translates it into SQL. 

- Response Generator: The module is activated by dialogue manager, selects a kind 
of response strategies and generates response sentence. 
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System and Data of domain independent 

Speech Recognized Semantic Retrieval Response Synthesized 




Data of domain dependent 



Fig. 1. System Overview 



- Text-to-speech Module: The module generates response sentence, synthesizes 
speech signals and playbacks this audio hies to user. 

In this paper, we have focused three modules; Semantic Interpreter, Retrieval 
Module and Response Generator. We believe that Speech Recognizer except language 
models and Text-to-speech Module are task independent. A task adaptation techniquie 
of the language models for the speech recognizer was preliminarily evaluated [12]. 

3 Task Independency and Dependency of System 

We have clearly separated semantics, retrieval and response modules into task indepen- 
dent parts and task dependent parts. System core was built while keeping completely 
task independent. 

3.1 System Core 

We used Chasen [14] as morphological analysis and PostgreSQL [15] as database re- 
trieval management system. We have constructed Semantics Modules based on existing 
Mt. Fuji sightseeing guidance system [5] with separating task dependent and independent 
parts. All parts of system core are clearly task independent. 

3.2 Data Sets 

Data sets consist of task independent data sets and task dependent data sets. The separated 
results are shown in Table 1 . 

4 Task Adaptation 

Figure 2 shows the how chart of task adaptation. 










376 S. Kogure and S. Nakagawa 



Table 1. Separation of task dependent data sets and task independent data sets 





Data 


Task Independent 


morphological dictionary except for noun and verb, syntactic 
grammar, noun and verb semantic dictionary for dialogue pro- 
cessing 


Task Dependent 


morphological dictionary for noun and verb, noun and verb se- 
mantic dictionary for retrieval processing, convert rule from se- 
mantic representation to retrieval pattern, field information of 
database, database, display format of retrieval result 




Fig. 2. Task Adaptation 



We consider what kind of data may be prepared as task-dependent knowledge when 
the task is applied. In the proposed framework, the application developer prepares the 
following: 

- A generally usable database (machine readable) 

- The format information of each field of the database 

- A corpus of user utterances (dialogue examples) 

The database information retrieval system first requires a database (generally, one 
that is open to the public) as a retrieval target. The format information of each field in 
the database should be defined in order to access the database. In addition, since the 
dictionary and language model are adapted to the database information retrieval system 
in the task, a corpus of the user utterances is required. 

For examples, Figure 3 shows a generally usable database and the format infor- 
mation of each field of the database in a literature retrieval system. In Figure 3, Paper 
information consists of a title of the paper, author(s) of the paper ( fami ly name and last 
name, keyword(s) of the paper, and so on. Transfer format information describes how 
each item of paper information is processed. In Figure 3, keywords, ’spoken language 
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understanding’, ’word spotting’, ’language model’ and ’heuristic search’ are substituted 
into array variable ’ keyword’. 



A generally usable database 



TITL: Word Spotting in Spontaneous Speech With Heuristic Language Model 
AUTH: Kawahara, Tatsuya / Munetsugu, Toshihiko / Doshita , Shuj i 
KYWD: spoken language understanding / word spotting / language model / 
heuristic search 



The format information of each field of the database 
Database format information 



S|TITL: |\$|title 

P I AUTH: I , I / I \$ I faminame, lastname 

P I KYWD : I / I / I \ $ I keyword 



Transfer format information 



DATABASE : : : 

ref : $ref_id{ int } , $tite {varchar (200 ) } , $cite { varchar (200 ) } , $year { int } , 
$abste {varchar (800) } 

aut : $aut_id{ int } , $ref_id{ int } , $f arae [] {varchar (50 ) } , 

$lase [] {varchar (50) } 

key: $key_id{ int } , $ref_id{ int } , $keyworde [] {varchar ( 50 ) } 

DATABASE : : : 



Fig. 3. An example of paper information and the definition given by a developer 



5 Application to Mt. Fuji Sightseeing Guidance System 

As tasks for the evaluation, Mt. Fuji sightseeing guidance system and literature re- 
trieval system were chosen. The system was applied to these tasks, and we have inves- 
tigated the efficiency of adaptation and the performance of the system. An example of 
using the system is shown in Figure 4. 

To prepare the task information took 2 hours, generation of morphological dictionary 
took 1 hour, generation of semantic dictionary took 10 hours, building retrieval database 
took 1 hour and constructing response format took 1 hour, respectively. Constructing 
task dependent data sets took 15 hours in total. 

When the previous Mt. Fuji sightseeing guidance system (that is, task-dependent 
system) [5] was changed into a task called East Mikawa sightseeing guidance system, 
the task adaptation took 84 hours. As compared with this, large efficiency has been 
attained in fhe system developing period. 

The 515 user utterances that were collected in the previous evaluation [5] were 
inputted into the system. Each input sentence is not speech but text, because our purpose 
is to evaluate the portability of the system. 

The semantic understanding rate and the response generation rate were 79.4% and 
68.0%, respectively. About 106 out of 515 sentences, the semantic understanding was 
failed, the rejection rate (for example, “Please say it again”) was 86.8%. Table 2 shows 
the efficiency and performance of the system. 
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===== Speech Input ----- 

Input :What accommodations are there around Kawaguchi Lake? 

<=== input utterance 

:What accommodations are there in Kawaguchi Lake? 

<=== recognized sentence 

===== System Output ===== 

7 facilities were found. 

===== Retrieval Result ----- 

No . 1 Lakeside Hotel (Hotel) in Kawaguchi Lake, 12000 yen 

No . 2 Rayground Hotel (Hotel ), in Kawaguchi Lake, 12000 yen 

No . 3 Kawaguchiko Hotel (Hotel ), in Kawaguchi Lake, 12000 yen 

No . 4 Kawaguchiko Ground Hotel (Hotel ), in Kawaguchi Lake, 12000 yen 

No . 5 Yamagishi Inn(Inn),in Kawaguchi Lake, 10000 yen 

No . 6 Pension Lausanne (Pension) , in Kawaguchi Lake, 10000 yen 

No . 7 Pension Crayon (Pension) , in Kawaguchi Lake, 10000 yen 

===== System Output ===== 

The retrieval result is the above. 



Fig. 4. Dialogue example (Mt. Fuji sightseeing guidance system). 



Table 2. Efficiency and performance of previous and proposed systems 



Evaluation criteria 


Mt. Fuji 


Literature 


Efficiency 


previous adapt. 


proposed adapt. 


proposed adapt. 


[hour.man] 


84 


15 


17 




Mt. Fuji 


Literature 


Performance(semantic 


previous 


proposed 


proposed 


understanding rate[%]) 


79.4 


79.4 


80.5 



6 Application to Literature Retrieval System 

We evaluated application to a literature retrieval system as well as the Mt. Fuji sightseeing 
guidance system. 

An example of using the system is shown in Figure 5 . When there are paper catalogues 
as some retrieved results, e.g., if the user inputs an utterance “Please display detailed 
information of the second paper.”, the system displays the detailed information like paper 
abstract of the second paper in the paper catalogue. 

To prepare the task information took 2 hours, generation of morphological dictionary 
took 2 hour, generation of semantic dictionary took 10 hours, building retrieval database 
took 2 hour and constructing response format took 1 hour, respectively. Constructing 
tasks dependent data sets took 17 hours in total. 

We have collected 82 utterances from 6 subjects(male). In order to collect these 
utterances, we used a WOZ(Wizard of OZ) system that substitutes semantic interpreter, 
response generator and dialogue manager. The unchanged retrieval module was used for 
retrieval. 

The collected 82 utterances were inputted into the system. The semantic understand- 
ing rate and the response generation rate were 80.5% and 79.3%, respectively. About 16 
out of 82 sentences, the semantic understanding was failed, the rejection rate was 62.5%. 
For 6 utterances, the rejection was failed, because only one retrieval key could not be 
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System Output ===== 

This is a document retrieval system. Please input retrieval conditions. 
===== Speech Input ===== 

Is there a paper on the multi modal? <=== input utterance 
input: There is a paper on the multi modal. <=== recognized sentence 

===== System Output ----- 
23 papers were found. 

Please input additional retrieval conditions. 

===== Speech Input ===== 

This is related to Internet, 
input: This is related to Internet. 

----- System Output ----- 
3 paper were found. 

===== Retrieval Result ===== 

No . 1 A. Nakashima, et al .: "Intelligent network for personal move 
communication" , Institute of Electronics, Information and 
Communication Engineers Journal of Japan, 1995) 

No . 2 K. Ono, et al .: "Development of new generation communication 

network" , Institute of Electronics, Information and Communication 
Engineers Journal of Japan, 1995) 

No . 3 H. Also, et al.:"Future prospects of information highway", 

Institute of Electronics, Information and Communication Engineers 
Journal of Japan, 1995) 

===== System Output ===== 

The retrieval result is the above. 



Fig. 5. Dialogue example (Literature Retrieval System). 



extracted from the utterance that has plural retrieval keys. Table 2 shows the efficiency 
and performance of the system. 



7 Summary 

We constructed the developing tool for portable spoken dialogue systems, while building 
the Mt.Fuji sightseeing guidance system and the literature retrieval system. 

In the Mt. Fuji sightseeing guidance system, we evaluated the efficiency and per- 
formance of the system. High portability of system development was proved, while 
we could take a considerably few hours to construct the system less than the previ- 
ous task-dependent system. Also, we constructed the literature retrieval system, and the 
developing period was the same as the Mt. Fuji’s one. In two prototype systems, the 
semantic understanding rate was about 80%. 

In the near future, we will construct the system which has high portability of dialogue 
manager. 
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Abstract. In this paper we describe the extension of the EasyDial developer in- 
terface of Dialogos, the spoken dialog system of Loquendo. EasyDial has been 
integrated with an XML-based representation of data structures and procedural 
knowledge which are dependent on the application domain. A set of XSLT pro- 
grams translates this knowledge into textual data and C language procedures which 
are integrated in Dialogos after the processing by EasyDial. The adoption of an 
XML-based representation for declaring the domain dependent knowledge speeds 
up the application development process significantly. 



1 Introduction 

The increasing demand for spoken dialog systems in different domains, ranging from 
providing information about trains or flights to personal services for accessing e-mail by 
telephone and to answering machines, is making more and more apparent the necessity 
for a rapid development of new applications, not necessarily involving the experts who 
implemented the dialog system shell. 

In order to fulfill this requirement, the dialog system shell must be encapsulated 
in a further level of abstraction which hides the details concerning data structures and 
centralizes the knowledge shared by the modules of the system. 

In this paper, we describe an approach to domain knowledge representation in Dialo- 
gos, the LOQUENDO spoken dialog system [1], and EasyDial, the developer toolkit for 
Dialogos [4], which aims at facilitating the development of spoken language applications 
in the telecommunication domain. 

This approach is based on an XML based representation of the knowledge to be 
acquired by the system in order to provide its service; in particular, this knowledge 
concerns the structure of the sentences to be generated. Based on the XML representation, 
an XSLT program produces both the data in the internal representation format of the 
system and a relevant portion of the C language procedures to be used by the Dialogos 
engine in a given application. 

XML has been chosen to define languages for representing the system knowledge 
since: 
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1. Its rapid diffusion as a standard ensures that it will be easier to find developers able 
to use it. 

2. The precise definition of the syntax of documents via Document Type Definitions 
prevents errors during editing, while XML editors facilitate the editing by suggesting 
options and by signalling errors. 

3. Its portability ensures the availability of programs for processing it on all platforms, 
both for producing XML documents and for operating on it (e.g., XSLT translators). 

4. The automatic translation of XML documents in programming language procedures 
restricts the possibility of errors by the developers. In fact, the procedures are rep- 
resented at a higher level of detail, via languages which are defined by means of 
XML. 

5. The abstraction from implementation details and the rigorous DTD declarations 
make XML documents easily adaptable to new versions of the dialog system. 

In the following, we will describe examples of use of XML in four cases: the repre- 
sentation of system parameters, the generation of dialogic turns, the generation of natural 
language phrases that describe the value of parameters, and the partially automated gen- 
eration of procedures for managing parameters. 



2 The Parameter Definition 

The parameters of Dialogos are knowledge structures which contain data which must 
be acquired by the system to provide the service. 

For example, in order to build a query to a database of flights, at least the cities 
of departure and arrival and some information about the date of flight are required. 
So DEPARTURE.CITY, ARRIVAL.CITY, and DEPARTUREJDATE, are parameters 
in Dialogos’ terms, and must be acquired by the system in order to provide the caller 
with an answer. Parameters may have atomic values (for example “Rome”) or complex 
values obtained from the meaningful composition of a set of atomic values (for example, 
“the eleventh of November”). 

The XML definition of parameters allows to gather some information about items 
which are often spread across different files. In particular, the XML definition specifies 
the structure of each parameter (e.g., which parameters it subsumes or which parame- 
ters it is composed of), its type, and the information about the initiative of the system 
towards it: whether it must be requested or confirmed. Finally, the information about 
the compatibility of the possible atomic sub-parameters is specified as well (see next 
Section). 

The parameter definitions not only allow to centralize the information about the pa- 
rameters themselves, but they constitute the source for the generation (by an appropriate 
XSLT program) of the data structures in C language which are used by Dialogos, and 
which were previously to be hand-written. In particular, the generation process exploits 
the knowledge about the type of parameters, and the subsumption information (i.e., the 
same C structure is shared by more than one parameter). 
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3 The Generation of Sentences 

As described in [3], the generation of sentences in Dialogos is organised in two levels: 
Dialog Moves (DMs, like Request for asking information, Verify for checking the system 
understanding, etc.) and Dialog Acts (DAs). 

DAs contain the knowledge about the form of the sentences which must he generated 
to perform a certain Dialog Move. They depend on the set of parameters (e.g., which are 
needed and which have already been acquired), and show a large variability. For example, 
a Verify DM concerning the departure city of a flight differs from one concerning the 
departure time: 

1 Vuole partire da Torino ? 

(Do you want to leave from Turin ?) 

2 Vuole partire alle quattro del pomeriggio ? 

(Do you want to leave at four o’ clock ?) 

Moreover, the same parameter (say DEPARTURE_TIME) can be expressed in dif- 
ferent ways according to what the caller said: 

3 Vuole partire fra le tre e le cinque ? 

( Do you want to leave between three and five o ’ clock ? ) 

Finally, the form of the same DA may vary according to the context; e.g., within the 
context of a misunderstanding [2]: 

4 Mi scusi, non ho capita. Vuole partire fra le tre e le cinque ? 

(Sorry, I did not understand. Do you want to leave between three and five ?) 

When the developing shell of Dialogos, EasyDial, was not supported by the XML 
module, such knowledge took the form of a long list of templates including all possible 
combinations of parameters and contextual flags. For example, the fourth DA listed 
above was represented as: 

PARAMETER: TIMEJNTERVAL 
CONTEXTUALJLAG: NOT_UNDERSTOOD 
PATTERN: Mi scusi, non ho capita. Vuole ~ ? 

where the first attribute denotes the parameter to be verified, the second one identifies 
the current state of the context (this DA should be used in case the previous caller’s turn 
was not understood properly by the system) and the third one the pattern to be generated 
(provided that the gap be filled with the value of the parameter, see next Section). 

The verbs ‘partire’ (‘leave’) and ‘arrivare’ (‘arrive’) (which are missing from the 
template) are generated together with the value of the parameter, depending on the 
parameter to which the template applies (departure or arrival) and on the presence of 
a contextual flag (e.g., in case of the destination city of a return flight, the verb takes the 
form of ‘ritornare’, ‘to go back’). 

The XML based representation of Dialog Acts captures the above generalities and 
provides a more compact representation. An XSLT program is used to generate auto- 
matically all possible combinations of parameters and contextual flags. The output of 
the XSLT program is in the human-readable format which currently constitutes the input 
of EasyDial, so that the development system needs not to be modified and the generated 
Dialog Act can be checked and refined by hand. 
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The DAs which generate sentences 1-3 above (with many others) can he captured 
hy a single XML construct together; it takes the following (somewhat simplified) form: 

<DA type=” VERIFY” NOT_UND=”YES”> 

<PARAMETER name=”TlME”/> 

<PATTERN>Vuole - ?</PATTERN> 

</DA> 

Since the parameter TIME can take different forms according to the availability of 
its component parameters, different DAs will be generated, among which the one above, 
where only the TIME JNTERVAL was involved. 

Since not all the combinations of component parameters make sense, (e.g., HOUR, 
the hour of departure, is not compatible with TIME JNTERV, ”* Vuole partire alle cinque, 
fra le tre e le sei?”, ”*Do you want to leave at five, between three and six?”), the illegal 
combinations of component parameters are contained in the XML parameter definition 
described in the previous section, which can be accessed by the XSLT program when 
generating the alternatives. 



4 The Generation of Parameters 

As discussed in the previous section, the generation of DAs does not cope with the 
problem of generating the values of the parameters. 

This task is accomplished by a set of procedures, which in the previous version of 
the system, had to be written by hand in C language. 

In order to automatize the task, we defined by means of XML a simple programming 
language: a DTD has been built for representing at a higher level of abstraction the 
instructions necessary for generating the value of parameters. In this way, one XML 
document contains the rules for generating the value for each parameter (and for the 
component ones). 

As a short example consider the representation corresponding to a C switch construct 
for transforming a number in the name of the corresponding weekday. 

<ITEM> 

<PAR NAME=”WEEK_DAY”/> 

<CASExNUMCONST VALUE=”rV>< STRING TEXT=”lunedi’”/> 

</CASE> 

<CASExNUMCONST VALUE=”2”/x STRING TEXT=”martedi’”/> 

</CASE> 

</ITEM> 

The automatic translation via XSLT of this construct accounts for the tasks of check- 
ing if WEEK_DAY is already bound, retrieving its value and assigning the output string 
to the appropriate variable (see Figure 2 for an example XSLT rule). 

Note that in the hand written functions all these tasks, together with the management 
of possible failures in case of missing values, had to be complied with for each parameter. 
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Since the XML syntax may seem to some people clumsier than standard program- 
ming languages, it must be noted that the developer has at his disposal an XML editor. In 
the research project, we are using Xeena (developed by AlphaWorks)' which checks the 
correspondence with the DTD and suggests the developer the syntax of the generation 
rules (see in Figure 1 a snapshot of the Xeena editor). 

5 Functions for the Parameters 

Besides the functions for generating natural language descriptions of the value of the 
parameters, the application developer currently has to write a number of functions for 
checking the consistency of parameters and for transforming their values or for retriev- 
ing them from the database or from the application. Some typical examples are the 
consistency control that Dialogos applies for validating the recognition results of time 
expressions, and the ones applied for checking that the return date is after the departure 
date acquired in a previous dialog turn. An automatic generation of these procedures is 
more problematic than the description of the natural language form of parameters, since 
they are more application-dependent: currently, they must be written by a developer who 
has a deep knowledge of the structure of the database and of the underlying application. 

Instead of developing an XML-based language for describing this knowledge at a 
higher level, we have chosen a semi-automatic approach. Starting from the knowledge 
about the parameters (see Section 2), a skeleton of each function is generated in an 
automatic way. This approach is inspired by the JavaBeans methodology for generating 
the ’interface’ of java classes. 

In fact, a number of functions are structured in a similar way and are associated to 
the parameters depending on their structure. 

For example, the prototypical function for checking the internal consistency of a 
parameter composed of simpler ones (like the parameter TIME described above, which 
is composed of HOUR, TIMEJNTERV, PART_OE_DAY) is structured in the following 
way: 

1. declare local variables; 

2. check the existence and retrieve the value of component parameters; 

3. perform a domaind dependent consistency check on the possible combinations of 

component parameters; 

4. return the result. 

While the task 3 is complex and is different for each parameter, the procedures for 
executing the remaining ones can be generated automatically by knowing the structure 
of the parameter and its type, which (as we saw in Section 2) are stored in an XML 
document. 

Even if the conceptual part of the work of the application developer must be made 
case by case, the existence of the structure of the functions to be written is of great 
help. Moreover, the automated generation of functions restricts the possibility of errors 
or missing conhgurations of parameters which should be detected by means of runtime 
debugging. 

* The Xeena editor of IBM Alphaworks can be found at 
http://www.alphaworks.ibm.com/tech/xeena. 
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Fig. 1. The Xeena XML-editor displaying generation rules. 



6 Conclusion 

In this paper, we have described the work on the EasyDial developer interface for Di- 
alogos. We have defined via XML some languages which are used as a representation 
formalism for storing both knowledge about data structures and procedural knowledge. 
This knowledge is then directly processed by EasyDial and integrated in Dialogos with- 
out any further intervention of the application developer. 
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<xsl:template match=”PAR”> 

if (exist ^aram_gps( <xsl:apply-templates select=”@NAME”/>jj 
<xsl:if test=”following-sibling::CASE”> 
char i-fnng<xsl:number level=”multiple”/ >[50]= ” 
itnMg<xsl:number level=”multiple’7 > = 

getjvalue_param_gps( <xsl:apply-templates select=”@NAME”/>J; 
switch(string <xsl:number level=”multiple”/>j { 

< !- switch on the value of the local variable -> 

</xsl:if> 

<xsl:apply-templates select=”following-sibling::CASE/STRING”/> 

<xsl:if test=”following-sibling::ITEM[position()=last()]”> 
ifistrgen == 

<xsl:apply-templates select=”following-sibling::ITEM’7> } </xsl:if> 

<xsl:if test=”not(following-sibling::ITEM[position()=last()])”> 
ifistrgen == ””){local-flag=false;} </xsl:if> 
else { <xsl:apply-templates select=”following-sibling::OBL’7> }; 

</xsl:tempIate> 

Fig. 2. An example of XSLT rule for translating generation rules. Text in italics corresponds to 
actual C code. Some pieces of them are generated or not according to XSLT conditionals. Eor 
instance, the input XML specifications can include constructs such as ’CASE’ ; in correspondence 
to it, a C ’switch’ will be generated. The ’xshapply-templates’ tag takes care of calling recursively 
the XSLT rules on the rest of the XML document. 

In this task, XSLT has proven to be a flexible and complete instrument for translating 
an XML document into data structures and C procedures, even if it was originally 
developed for dealing with document styles. 
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Abstract. This paper describes the design of the dialogue based integrated de- 
velopment environment. It discusses the basic idea of the system — grammar 
based dialogue generation. Generating dialogues by means of grammar allows 
the system to be nearly independent on the programming language and to limit 
syntactical errors in the generated code. The rest of the paper is devoted to the 
techniques making dialogue based programming more effective. 



1 Introduction 

Dialogue systems are nowadays applied in many areas. Information retrieval dialogue 
systems are an example of such an application (see e. g. [7]). This paper deals with an- 
other application of dialogue systems: dialogue based programming. The term dialogue 
based programming has been introduced in [6] . The author has proposed a new program- 
ming language intended especially for blind users. A spoken dialogue system allowing 
programming in this new language has been proposed in the mentioned paper as well. 
A preliminary version of this system has been implemented later (see e.g. [1],[2]). 

The system described in this paper has a slightly different goal. It provides a dialogue 
programming environment for blind programmers for several widely-used programming 
languages, at least for C/C-H- and Java. Code name of the project is AudiC. The system 
is speech oriented, i.e. no visual feedback is required. It is fully controlled by speech 
and keyboard commands. Combinations of synthesized speech, sampled speech and 
environmental sounds (see e.g. [3] or [4]) are used for the communication with the 
user. The system communicates in the Czech language. The function of the system is 
illustrated in Table 1. 

One of the principal properties of the system is that it provides a uniform environ- 
ment for different programming languages and it is easily adaptable to a new language. 
This feature is basically achieved by the procedure, which transforms the grammar of the 
particular programming language to a dialogue representation (see [8]). This approach 
has another advantage as well: the resulting source code contains much less syntacti- 
cal errors than the code written “by hand”. In what follows we describe the dialogue 
generation based on the grammar of the C language. 

2 Grammar Based Dialogue Generation 

We have implemented an automatic tool converting a grammar of the C language into a set 
of dialogues. The input data of the converter is the programming language grammar — 
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Table 1. A sample dialogue of code generation (some dialogues are shortened). The resulting code 
is shown in Figure 1 . 

The following notation is used in the example: 

P: programmer 
S: AudiC System 
* - earcon 

Dialogue Comment 



S: main function will be generated 

S: choose return type * int * void * long * other 

P: int 

S: enter parameters 
P: 


timeout ( user enters nothing for a 
long time) 


S: enter parameters or use help 
P: iemptyi 


user enters empty string ( no param- 
eters) 


S: generating left brace 

S: choose one * labeled statement * expression statement * 
compound statement * selection statement * iteration state- 
ment * jump statement * function 
P: 4 


user chooses the fourth item - “se- 
lection statement” 


S: choose one * IF ( expression ) statement * IF ( expres- 
sion ) statement ELSE statement * SWITCH ( expression ) 




statement 




P: 1 

S: generating IP, generating left parenthesis 
S: enter expression 
P: help 

S: Expression is . . . 


short description of the expression 


P: more 


user asks for more detailed help in 
this case, help (in three level of de- 
tail) can be also invoked viafuntion 
keys FI, F2, F3 


S: Expression is . . . 


a more detailed help 


P: jEsc key^ 

S: enter expression 
P: x^Ox 


user leaves help 


S: * Incorrect syntax, enter again 


simple syntax check, in more diffi- 
cult cases, the error type and posi- 
tion can be specified 


P: x^O 

S: generating right parenthesis 
S: choose one . . . 


the dialogue continues with the gen- 
eration of an arbitrary statement 
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for AudiC it is C or C++ grammar. The converter from grammar to dialogues can be used 
off-line (dialogue descriptions are pre-created and then interpreted during runtime) or on- 
line (dialogue descriptions are created and interpreted during runtime). We have chosen 
VoiceXML language for the description of dialogues as well as dialogue flow. VoiceXML 
as is an open, broadly supported markup language for speech-based applications (see [5]). 
In what follows we describe the dialogue generation process and present a few examples 
of the resulting dialogues. 



2.1 Off-Line Dialogue Generation 

The C grammar is basically a context-free grammar with definitions on the right hand side 
containing symbols of three types: non- terminal grammar symbols (in the examples in 
lowercase), optional non-terminal grammar symbols (mixed case) and terminal grammar 
symbols (uppercase). The grammar rules for generating the iteration statement form a 
typical example: 

iteration- statement — >■ WHILE (expression) statement 

iteration- statement — >■ DO expression WHILE (statement) 

iteration- statement — ^ EOR (Expression; Expression; Expression) statement 

The converter translates such rules into a file in the VoiceXML format describing the 
dialogue flow. The dialogue description defines the system messages, responses to user 
commands as well as some special commands (built-in functions) generating the source 
code. During the interpretation, every dialogue will (besides the communitation with 
the user) generate one or more nodes in the syntactical tree of the generated program. A 
type is assigned to every node (see Section 4). We distinguish four possibilities that can 
occur: 

1 . There is only one terminal symbol on the right hand side of the rule — the correspond- 
ing node has the type leaf. The converter generates a simple dialogue consisting of 
the built-in function call only. This built-in function generates the node in the source 
code tree with the corresponding constant content (e.g. bracket or semicolon) or 
collected user input (e.g. identifier or expression). 

2. There are several grammar rules with the same left side, all of them containing only 
one non-terminal symbol on their right hand sides. The converter generates a menu- 
like dialogue with items connected to dialogues appropriate to the non-terminals — 
type node (see Example 1). User can choose among possibilities represented by the 
non-terminals during interpretation. 

3. There are several grammar rules with the same left side, all of them containing the 
list of terminal and non-terminal symbols on their right hand sides. The converter 
generates one dialogue for each rule and one dialogue with menu items connected 
to those dialogues. All dialogues corresponding to the particular rules demand no 
user feedback and generate the tree structure only. They are “hidden” for the user - 
these nodes have type hidden node (see Example 2). 

4. There is only one non-terminal symbol on the right hand side. The converter gener- 
ates only one dialogue consisting of a built-in function generating the corresponding 
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node and the jump statement redirecting the interpretation to the dialogue corre- 
sponding to non-terminal appearing on the right hand side — this node has type 
branch. The rule expression-statement expression is a typical example. 

Example 1. Simple non-terminals. The rules 

statement — >■ labeled- statement 
statement — >■ expression-statement 
statement — >■ compound- statement 
statement — >■ selection-statement 
statement — >■ iteration- statement 
statement — ^ jump-statement 

are transformed into the following dialogue description 

<form id= " Statement " > 

<obj ect classid= "bull tin : / / shared . dll#setNodeType " > 

<param name= "type" value="node " /> 

</object> 

<field name="SUBl"> 

<prompt> Choose one: <enumerate/> </prompt> 

<option dtmf="l" value= "#labeled-statement " > 
labeled statement </option> 

<option dtmf="2" value= "#expression-statement " > 
expression statement </option> 

<option dtmf="3" value= "#compound- statement " > 
compound statement </option> 

<option dtmf="4" value= "#selection- statement " > 
selection statement </option> 

<option dtmf="5" value= "#iteration- statement " > 
iteration statement </option> 

<option dtmf="6" value="#jump statement"> 
jump statement </option> 

</ f ield> 

<help xgoto next= "help . VXML#statement " /></help> 

<obj ect classid= "builtin : / / shared . dll#makeNode " > 

<param name="co" expr= " ' cmaingrammar . vxml ' -i- SUBl"/> 

</object> 

</field> <block name="OUTRO" > <goto next="#go"/> </block> 

</ form> 



Example 2. Lists of the terminal and non-terminal symbols (the corresponding rules are 
shown at the previous page). We can create dialogues for the hidden nodes: 

<form id= "WHILE_ (_expression_) -Statement " > 

<obj ect classid= "builtin : / / shared . dll#setNodeType " > 

<param name="type" value="hiddennode"/> 

</object> 

<obj ect classid= "builtin : / / shared . dll#makeTypedNode" > 

<param name="co" value= "WHILE" /> 

<param name= "type" value=" leaf " /> 

<prompt> Generuj i WHILE </prompt> 

</object> 

<obj ect classid= "builtin : / / shared . dll#raakeTypedNode" > 

<param name="co" value="("/> 

<parara name= "type" value=" leaf " /> 

<prompt> Generuj i levou zavorku</prompt> 

</object> 

<obj ect classid= "builtin : / / shared . dll#makeNode" > 

<param name="co" expr= " ' cmaingrammar . vxml ' -p '#vyraz'"/> 

</object> 
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<obj ect classid= "bull tin : / / shared . dll#makeTypedNode" > 
<param name="co" value=")"/> 

<param name= "type" value=" leaf " /> 

<prompt> Generuj i pravou zavorku</prompt> 

</object> 

<obj ect classid= "bull tin : / / shared . dll#makeNode" > 

<param name="co" expr= " ' cmaingrammar . vxml ' + '#pfikaz'"/> 
</object> 

<block name="GO"> <goto next="#go"/> </block> 

</ form> 



and a dialogue similar to the previous situation — node 

<form id= " iteration_statement " > 

<obj ect classid= "built in : / / shared . dll#setNodeType " xparam name= "type " 
value= "node " / ></ ob j ect > 

<field name= "SUBl " ><PROMPT> Choose one: <enumerate/> </PROMPT> 

<option dtmf="l" value= "#WHILE_ (.expression.) .statement " > 

WHILE ( expression ) statement </option> 

<option dtmf =" 2 " value= "#DO_s tat ement .WHILE, (.expression.) .; " > 

DO statement WHILE ( expression ) ; </option> 

<option dtmf = " 3 " value= "#FOR. (.Expression.; .Expression.; .Expression.) .statement " > 
FOR ( expression ; expression ; expression ) statement </option> 

</ f ield> 

<helpxgoto next= "help . VXML#iteration.statement " / x/help> 

<obj ect classid= "built in : / / shared . dll#makeNode" > 

<param name="co" expr= cmaingrammar .vxml ' + SUBl"/> 

</object> 

<block name="OUTRO" > <goto next="#go"/> </block> 

</ f orm> 



The dialogue flow starts at the dialogue corresponding to the compound statement that 
creates the body of the main( ) function. We have also “cut” grammar in some points and 
created some new terminals (e.g. the expression statement — original grammar doesn’t 
allow to construct expression comfortably). 



2.2 On-line Dialogne Generation 

Some dialogues in our system are generated during runtime. The dialogue description 
is also dynamically updated according to the generated source code. 

We can demostrate this feature on C or C-H- functions. When the user wants to add the 
function call to the source code, he is provided by the list of all available functions, both 
library functions and user defined functions. However, every time the user declares a new 
function the set of user dehned functions changes. The same occurs in the case of library 
functions — starting with the blank source code no library functions are available to the 
user, but when the user adds the #include directive in the dehnition block, all functions 
from the corresponding libraries become available to the user. The system generates a 
new list of available functions after each relevant operation and dynamically updates the 
dialogue(s) providing the list of functions to the user. 

These dialogues are optimized for blind user access. Each dialogue have maximum 
of 10 items - the items are accessed via numeric keypad. When the number of available 
functions exceeds 10, the system moves 2 items into a new dialogue and add the item 
“Next” into the original dialogue. The functions are re-sorted by name too. 

Among dynamic features belong grammar shortcuts as well. Immagine, for example, 
that the programmer often uses the /or cycle. He can add shortcut to this statement to 
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the topmost level of statements — for cycle then appears at the same level (in the same 
dialogue) as expression statement or compound statement. 

3 Source Code Generation 

Dialogue definitions described in the previous section contain all the information needed 
for source code generation. The system loads the dialogue description in VoiceXML and 
interprets it. In addition to the utterances presented to the user the description contains 
also calls of some built-in functions. These functions generate the source code. A typical 
call of a built-in function looks like the following example: 

<obj ect classid= "builtin : / / shared . dll#makeNode " > 

<param name="co" expr="SUBl"/> 

</object> 

The source code is represented as a tree structure. Each node is labeled by the 
corresponding terminal or non-terminal symbol of the grammar and a type (see Figure 1). 
The type is used for traversing and editing the tree (see Section 4). 




Fig. 1. Screenshot of the AudiC system. There is a tree view of the generated source code at the 
left side of the picture. The type of the node is represented by an icon. The code in its textual form 
is in the upper right corner of the picture. 
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4 Source Codes Editing 

Source code generation is only one of the tasks of the integrated development envi- 
ronment. Another important stage is source code editing. AudiC strongly supports syn- 
tactically correct code generation. The system preserves this correctnes in the editing 
stage as well. For this purpose, the syntactical information present in the generated tree 
structure is used. Each node has assigned a label and a type. These attributes define the 
set of actions (performed again by some dialogues) allowed in this node. Let us briefly 
summarize the most important actions: 

- Reading. Read the content of the node or possibly of the subtree under the node. 
Several reading modes are available differing in the amount of information conveyed 
to the user and in the form of the utterance (synthesized speech, sampled speech, 
environmental sounds or some combinations). 

- Creating remarks. Add the remark to the node. The remark can be either written 
or spoken. The written remark is added to the resulting source code as a standard 
C comment. Spoken remark is replayed when the node content is read. 

- Editing. Changes the contents of the node. The action Edit is allowed only on the 
nodes with the labels appearing on the left side of any rule of the grammar (type 
node). The system allows the user to choose among all grammar rules with the left 
side equal to the label of that node. 

- Deleting. Deletes the subtree of the current node. Delete is similar to Edit. It is again 
allowed only in nodes with the type node. 

The system allows to work with the source code in more positions of the syntactical tree 
at a time. The node of the tree together with the name of the dialogue responsible for 
generating or updating it form the structure called context. The user can switch among 
contexts quickly and comfortably. The old context is suspended and a new context loaded 
into the VoiceXML interpreter by the user’s request. There are three standard contexts 
used in the system, but the user can add (and remove) his own contexts. The three 
standard contexts are: 

1. Global Declarations — generating global variables. 

2. Functions — generating user functions including the main function. 

3. Local Declarations — generating variables of the currently defined function. 

5 Conclusions 

The AudiC system is a prototype of a dialogue-based programming environment. Its 
design and properties show that it may become a useful tool for the blind programmers. 
Its ability to generate the code with less syntactical errors than common environments 
may serve for learning the particular programming language as well. 

In the future, we expect to increase the functionality of the system employing user 
modeling or some source code analysis. 
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Abstract. In this paper we present an algorithm for the (semi-)automatic iden- 
tification of anaphors whose antecedents are verbal phrases, clauses or discourse 
segments in Danish Dialogues. Although these anaphors are quite frequent, espe- 
cially in conversations, they are usually been neglected in computational linguis- 
tics. The algorithm we propose contains defeasible rules for distinguishing these 
anaphors from those who have individual nominals as antecedents. The rules have 
been identified by looking at the occurrences of these types of anaphor in the tran- 
scriptions of two dialogue collections. The algorithm has been manually tested on 
four Danish dialogues and the obtained results have been evaluated.* 



1 Introduction 

This paper deals with the identification of Danish anaphors whose antecedents are ver- 
bal phrases, clauses and discourse segments. Following [18], we call them discourse 
deictics.^ Two examples of Danish discourse deictics are given in (1). 

( 1 ) og sd pr0vede jeg sd at gd lidt i sv0mmehallen ( 1 sek) og det pr0ver jeg sddan ind 
imellem, men jeg hader det 

(lit. and then tried I then to go a little to the swimming pool (1 sec.) and it try 
I such from time to time, but I hate it) 

(and then I tried to go a little to the swimming pool (1 sec.) and I still try from 
time to time, but I hate it) 

The two occurrences of the pronoun det (it/this/that) in (1) refer to the infinitive at 
gd i sv0mmehallen (to go to the swimming pool). 

Although discourse deictics occur very frequently, especially in spoken language, 
they are seldom dealt with in literature, because their treatment is quite problematic. 
First of all it is difficult to distinguish them from pronouns with non-abstract antecedents 
because the same pronouns are used to refer to both abstract and non-abstract entities. It 
is also hard to recognise the correct antecedent, i.e. verbal phrases, clauses or discourse 

* The research described has been done under the Staging project which is funded by the Danish 
Research Councils. 

^ Anaphors with nominal antecedents having an abstract referent can be included in the group of 
discourse deictics, but we have not looked at them in the present work. 
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segments. Finally the semantic object pointed to by the deictic must be identified, see i.a. 
[13,18]. Despite these difficulties it is important to identify discourse deictics because 
they cannot be treated as anaphors referring to non-abstract objects. This paper deals 
with this aspect. 

We have based our study of discourse deictics on their occurrences in the tran- 
scriptions of two Danish dialogue collections, Samtale hos Lcegen (’’The Talk at the 
Doctor’s”), henceforth SL, [4,12] and BySoc [9,11]. The conversations have been col- 
lected by researchers at the University of Copenhagen and contain approx. 89,000 and 
one million running words, respectively. We have supplied our research by looking at the 
occurrences of discourse deictics in the written text corpus, Bergenholtz [2] containing 
approx, five million words. 

In section 2 we present the Danish data. In section 3 we discuss the background for 
our work and we propose preference rules for identifying Danish discourse deictics. In 
section 4 we evaluate these rules while in section 5 we make some concluding remarks. 



2 Danish Discourse Deictics 

Discourse deictics in Danish comprise the following third-person neuter gender pronouns 
and demonstratives: det (it, this, that), dette (this), det her (this) and det der (that). The 
most common discourse deictic is det, while dette is mostly used in written language. 
We only found one occurrence of it in our dialogue collections. 

Examples of Danish discourse deictics are the following: 

- discourse deictic corefers with a clause: 

(2) A: Du skal tage en blodpr0ve 
(You have to take a blood test) 

B: Hvorfor det? 

(Why is that?) 

- discourse deictic is used as the subject complement of “vcere (be) and blive (become) 
in answers (or in coordinated successive clauses): 

(3) a. A: Blev dufcerdig med opgaven? 

(Did you finish the task?) 

B: Ja, det blev jeg 
(lit. Yes, that did I) 

(Yes, I did) 
b. AiErdusyg? 

(Are you ill?) 

B: det er jeg 
(lit. that am I) 

(Yes, I am) 

- discourse deictic corefers with a verb phrase when it is used as the object complement 
of the verb have (have), modal verbs and with the verb g0re (do), which replaces 
the lexical verb in the previous clause in cases where the finite verb of the clause is 
not an auxiliary or a modal: 




398 



C. Navarre tta 



(4) a. A: har de set lejligheden? 

(have they seen the apartment?) 

B: det har de 
(lit. that have they) 

(Yes, they have) 

b. A: Skal du ogsd Icese filosofi? 

(Are you also going to study philosophy?) 

B: Nej, det skal jeg ikke 
(lit. No, that am I not) 

(No, I am not) 

c. Jegfaldt, men det gjorde hun ikke 
(lit. I fell, but that did she not) 

(I fell, but she did not) 

- discourse deictic co-refers with an infinitive clause: 

(5) At ryge er farligt og det er ogsd dyrt 
(Smoking is dangerous and it is also expensive) 

- discourse deictic corefers with a clause in constructions with attitude verbs and other 
verbs which take clausal complements, such as tro (believe), vide (know), sige (say) 
and pr0ve (try): 

(6) a. A: er du 0m her ved livmoderhalsen 

(does it hurt here by your cervix uteri) 

B: nej ... det tror jeg nu ikke 
(lit. no ... that think I not) 

(no.. I don’t think so) 

SL 

b. A: du kan ligesd godt gdfra sd tidligt som muligt selvf0lgelig 
(you can just as well go on leave as early as possible of course) 

B: ja, det synes jeg 
(lit. yes that think 1) 

(yes, I think so) 

SL 

- discourse deictic refers to more clauses, or to something that can be vaguely inferred 
from the previous discourse (vague anaphora) as it is the case in the following 
example: 

(7) A: nu skal vi jo have ?(lille)? drengen til... i skole her til august jo 
(now we must have the ?(little)? boy in school here in august) 

B:;a 

(yes) 

A: sd skal han starte pd den der Kreb- eller (ler) 

(then he has to begin in that Kreb- or) (laughs)^ 

B: skal han det 
(lit. has he that) 

(has he?) 

A: ja... han skal. det vil jeg sgu’ godt give ham 
(lit. yes... he has, that will I certainly give him) 

(yes... he has, I will certainly give it to him) 

SL 

^ Here it is referred to Krebsskolen, a private school in Copenhagen. 
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In example (7) det refers to the fact that speaker A wants to pay the school fee to his 
child and allow him to attend a renowned private school. These facts are not explicitly 
stated in the conversation. 

As in English Danish discourse deictics can refer to one or more verbal phrases, one 
or more clauses, a discourse segment and something that can be vaguely inferred from 
the context. Furthermore Danish deictics are used in cases where elliptical constructions 
are common in English and instead of “do so/do too” constructions. A characteristics of 
Danish discourse deictics is that they often appear before the main verb, in the place that 
is usually occupied by the subject, as it can be seen in examples (2)-(7). This position 
is called “fundamentfelt” (actualisation field) in [3]. 



3 Identifying Discourse Deictics 

Discourse deictics are even more common in Danish than in English, especially in 
dialogues. For instance annotating the pronominal anaphors in four dialogues from the 
SL collection we found that 216 out of 395 personal and demonstrative pronouns were 
discourse deictics. Although discourse deictics are so common, only one algorithm has 
been proposed for resolving them, the ES99-algorithm [6,5]. Eckert and Strube, ES99 
henceforth, define the ES99-algorithm for resolving anaphors referring to individual 
nominals and abstract objects in English telephone conversations. The algorithm contains 
rules for discriminating among the two types of anaphor based on the predicative contexts 
in which the anaphors occur. Anaphors classified as referring to non-abstract objects 
are resolved with a centering-based algorithm [17]. Anaphors recognised as discourse 
deictics are divided into different types and some of them are then resolved with a specific 
algorithm. ES99 manually test the approach on selected dialogues and obtain a precision 
of 63,6 % for discourse deictics and of 66,2% for individual anaphors. The precision for 
individual anaphors is much lower than that obtained when centering-based resolution 
algorithms are only applied to anaphors with non-abstract antecedents. 

The ES99-algorithm was adapted to Danish with slightly better results than those 
obtained by ES99, but it was found too simplistic for correctly classifying and resolving 
different types of discourse deixis [ 1 6, 1 5] . Although we agree, we believe that the ES99- 
strategy of identifying discourse deictics from their contexts is useful to NLP systems 
and that this part of their algorithm is worth pursuing. The strategy is also in line with 
the studies of English discourse deictics in [8,1]. Thus we have decided to investigate 
the contexts in which Danish discourse deictics occur and extend the original ES99 rules 
with both general and Danish specific rules. Most of the rules are preference rules, thus 
defeasible. Of the rules we present in the following the first four are simply adaptations 
to Danish of the ES99-rules and are marked with an asterisk The remaining rules 
have been identified by us. 

Rules for identifying Danish discourse deictics: 

1. * constructions where a pronoun is equated with an abstract object, e.g., x er et 
forslag (x is a suggestion) 

2. * copula constructions with adjectives which can only be applied to abstract entities, 
such as X er sandt (x is true), x er usandt (x is untrue), x er rigtigt (x is correct) 
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3. * arguments of verbs which take S’ -complements, e.g., fro (believe), antage (as- 
sume), mene (think), sige (say) 

4. * anaphoric referent in constructions such as x erfordi du er holdt op med at ryge (x 
is because you have stopped smoking) x er pd grand afat du er gravid (x is because 
you are pregnant) 

5. object of g0re (do) 

6. subject complement with vcere (be) and blive (become) in answers or in coordinated 
clauses 

7. object of have (have) if the verb was not used as a main verb in the previous clause 

8. object of modal verbs 

9. in copula constructions where the adjective can both refer to an individual NP and 
to an abstract object, such as x er godt (x is good), x er ddrligt (x is bad) the anaphor 
co-refers with an abstract object if the previous clause contains a raising adjective 
construction, or constructions where an infinite is the subject 

10. in constructions where the anaphors are objects of verbs such as elske (love), hade 
Qxaie), foretrcekke (prefer) the anaphor co-refers with an abstract object if the pre- 
vious clause contains a raising adjective construction or constructions where an 
infinite clause is the subject (see rule 9) 

11. in constructions of the type det lyder godt (it sounds well) det lyder ddrligt (it 
sounds bad) det corefers with a discourse segment unless the previous utterance/ 
clause contains a nominal or a verb referring to sounds 

Rules 9-11 deal with pronouns that can both have an abstract and a non-abstract 
referent. Rule 10 is illustrated by the following two examples: 

(8) a. Peter ejede det store r0de hus ved k0bmandsbutikken. Det hadede han. 

(lit. Peter owned the big red house near the grocer’s store. It hated he.) 
(Peter owned the big red house near the grocer’s store. He hated it) 
b. Det er d0dsygt at sidde pd et vaskeri. Det hader jeg. 

(lit. It is boring to be in a laundry. It hate I) 

(It is boring to be in a laundry. I hate it) 

In example (8-a) the algorithm chooses det store r0de hus ved k0bmandsbutikken 
(the big red house near the grocer’s store) as the antecedent of det, while in example (8-b) 
it chooses at sidde pd et vaskeri (being in a laundry) instead of et vaskeri. 

It must be noted that in cases as example (8-a), it is often not possible to determine 
whether the anaphor refers to an individual NP or an abstract object without a deeper 
analysis of the discourse. Obviously our simple rules will fail to detect these ambiguities. 



4 Evaluation 

To test the rules we have randomly chosen four dialogues from the SL-collection, and 
manually marked all the occurrences of third singular person neuter personal and demon- 
strative pronouns as individual anaphors or discourse deictics. Pleonastic uses of det (it) 
have also been marked as non-anaphoric and have been excluded from the test. The rules 
for identifying discourse deictics have been manually applied to the unmarked dialogues. 
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Lines in the dialogues containing the constructions indicated hy the rules have been au- 
tomatically extracted from the tagged dialogues"^ and the results have been manually 
checked. Verbs taking S’ -complements and raising adjectives have been automatically 
marked using the syntactic encodings of the Danish PAROLE lexicon.^ The results of 
the human disambiguation and the rule-based identification have then been compared. 
The success rate for the discriminating algorithm was of 86,13 %. Cases of failure were 
especially anaphors occurring in constructions allowing for both an individual NP an- 
tecedent and an abstract object antecedent, which are not covered by rules 9, 10 and 
11. An example are objects of verbs which usually take a concrete object, but are used 
metaphorically, such as sluge (swallow). 

One problem with the test we made is that we applied the algorithm on the same 
type of dialogue which we used to identify the algorithm’s rules. Although we have also 
looked at discourse deictics in a written corpus to identify the rules, it is possible that 
there are cases of identifiable deictics which we have not covered. 

5 Conclusion and Future Work 

In the paper we have proposed rules for the (semi-)automatic identification of Danish 
discourse deictics on the basis of the contexts they occur in. The idea is taken from 
[6,5]. The first test of these rules gave good results, but it was made on a subset of the 
dialogues used to identify the rules. Thus they should be tested on other types of dialogue 
and on written texts. The discriminating rules should also be supplied with a semantic 
lexicon containing information about metaphorical uses of verbs and nominals referring 
to abstract objects. Although we believe that the results of the algorithm would be 
improved by such a lexicon, it is impossible, in our opinion, to discriminate all cases 
of anaphors which can both refer to non-abstract and abstract objects without a deep 
analysis of the context. 

In this paper we have not addressed at all the issue of how to resolve the identified 
discourse deictics. However looking at the contexts in which the anaphors occur also 
helps to identify the type of semantic object referred to by the anaphors [18,8] and we 
will investigate this aspect in our future work. 
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Abstract. Over the last few years, stochastic models have been widely used in 
the natural language understanding modeling. Almost all of these works are based 
on the definition of segments of words as basic semantic units for the stochastic 
semantic models. 

In this work, we present a two-level stochastic model approach to the construction 
of the natural language understanding component of a dialog system in the domain 
of database queries. This approach will treat this problem in a way similar to the 
stochastic approach for the detection of syntactic structures (Shallow Parsing or 
Chunking) in natural language sentences; however, in this case, stochastic semantic 
language models are based on the detection of some semantic units from the user 
turns of the dialog. We give the results of the application of this approach to the 
construction of the understanding component of a dialog system, which answers 
queries about a railway timetable in Spanish. 



1 Introduction 

Language Understanding systems have many applications in several areas of Natural 
Language Processing. Typical applications are train or plane travel information retrieval, 
car navigation systems or information desks. In the last few years, many efforts have 
been made in the development of natural language dialog systems which allow us to 
extract information from databases. The interaction with the machine to obtain this 
kind of information requires some dialog turns. In these turns, the user and the system 
interchange information in order to achieve the objective: the answer of a query made 
by the user. Each turn (a sequence of natural language sentences) of the user must 
be understood by the system. Therefore, an acceptable behavior of the understanding 
component of the system is essential to the correct performance of the whole dialog 
system. 

There are different ways to represent the meaning of natural language sentences in 
the domain of database queries. One of the most usual representations in tasks of this 
kind is based on frames. In a frame, we can represent a user turn of the dialog as a concept 
(or a list of concepts) and a list of constraints made over this concept. 

Many works in the literature use rule-based techniques in order to obtain the transla- 
tion of a user turn into the corresponding frame. They are mainly based on the detection 
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of keywords, which characterize the semantic constituents of the sentences. Other ap- 
proaches are based on statistical modeling. Over the last few years, stochastic models, 
which estimate the models automatically from data, have been widely used in natural 
language understanding modeling [6] [12] [7] [14]. 

Almost all of these works are based on the dehnition of segments of words as ba- 
sic semantic units for the stochastic semantic models. In most of them, the dehnition of 
classes of words is necessary in order to obtain high coverage models from the given data, 
(The problem of the lack of training data is always present when automatic learning tech- 
niques are used). This approach to the natural language understanding problem presents 
a strong parallelism with the stochastic approach applied in recent years [2] [8] [9] to 
the problem of tagging texts, when the objective is not only to associate POS tags to 
words but to detect some syntactic structures such as NP, VP, PP, etc. In the hrst case, 
the segments represent semantic units and, in the second one, they represent syntactic 
units. 

In this work, we present a two-level stochastic model approach to the construction 
of the natural language understanding component of a dialog system in the domain of 
database queries. This approach will treat this problem in a way similar to the stochastic 
approach for the detection of syntactic structures (Shallow Parsing or Chunking) in 
natural language sentences. However, in this case, stochastic semantic language models 
are based on the detection of some semantic units from the user turns of the dialog. 
We describe the application of this approach to the construction of the understanding 
component of a dialog system, which answers queries about railway timetable in Spanish. 



2 System Overview 

The dialog system has been developed in the BASURDE project [1], and it follows 
a classic modular architecture. The input of the understanding module is a sequence 
of words, and the output is the semantic representation of the sentence (one or several 
frames). This output constitutes the input to the dialog manager, which will generate the 
corresponding answers following the dialog strategy. The knowledge sources used by 
the understanding module are the syntactic and semantic models, and, optionally, the 
dialog act predicted by the dialog manager. 

The semantics of input sentences is represented by frames [13], like in other dialog 
systems [5]. Each type of frame has a list of cases associated to it. These cases, which 
will be hlled through the understanding process, are the constraints given by the user. We 
consider two types of frames: task-dependent, if the sentence is a complete or incomplete 
query, or a conhrmation of some information and task-independent, if the sentence is 
an affirmation, negation, courtesy, etc. 

Each sentence can be represented by one or more frames. In the railway timetable 
task, a list of 1 1 types of frames and 22 cases were defined. An example of an input 
sentence and the corresponding frames are: 

INPUT SENTENCE: Yes. I would like to know the price and the type 

of the train that leaves at 23 hours and 5 minutes. 

OUTPUT FRAMES: 

(AFFIRMATION) 

(PRICE) 
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DEPARTURE-TIME: 23:05 
(TYPE) 

DEPARTURE-TIME: 23:05 

The understanding process consists of two phases (see Figure 1). In the first phase, 
the input sentence is sequentially transduced into a sentence of an intermediate semantic 
language [11], and, in the second phase, the frame or frames associated to this sentence 
are generated. The models used for the sequential transduction are stochastic models, 
which are automatically learnt from training samples. Some rules are applied in the 
second phase in order to obtain the corresponding frames from the semantic sentence. 



Input sentence 


Orthographic/Semantic 


Sequence of pairs: 
Segment/Semantic unit 


Frame Generation 


Frame 




Decoding 







Fig. 1. General Description of the Understanding Process 



The intermediate semantic language is defined over a vocabulary of semantic units. 
Each semantic unit stands for a specific meaning. Due to the fact that the defined semantic 
language is sequential with the input language, we can perform a segmentation of the 
input sentence into a number of intervals which is equal to the number of semantic 
units in the corresponding semantic sentence. An example of segmentation of an input 
sentence is: 



Input sentence: 

me podria decir los horarios de Irenes para Barcelona 
(can you tell me the railway timetable to Barcelona) 



Spanish 

It 1 = me podria decir v i =consulta 

U2= los horarios de Irenes D2=<hora_salida> 

it3=para i)3=marcador_destino 

it4=Barcelona D4=ciudad_destino 



English 

It 1= can you tell me r;i= query 

it2= the railway timetable U2=<departure_time> 
it3= to U3=destination_marker 

it4= Barcelona r;4=destination_city 



3 General Description of the Two-Level Stochastic Models 

In order to implement the first phase, we propose an approach based on a two-level 
stochastic model. This model combines different knowledge sources at two different 
levels. The top level models the intermediate semantic language, that is, the set of 
sequences of semantic units. The lower level represents the internal structure of each 
semantic unit in terms of the linguistic units considered (words, POS, lemmas, etc.). The 
formalism [9] that we use for the models in the two levels is finite-state automata. To be 
exact, we use models of bigrams which are smoothed using the back-off technique [4] 
in order to achieve full coverage of the language considered. The bigram probabilities 
are obtained by means of the SLM TOOLKIT [3] from the sequences of different units 
in the training set. They are then represented as finite-state automata. 

All these models are estimated from a corpus of sentences which are segmented in 
semantic units. From this training set, we learn the following models: 
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(b) Model for the Semantic Unit ‘Si’ 




(c) Integrated LM 



Fig. 2. Integrated Language Model 



- The Semantic Language Model (Figure 2 (a)), that is, a bigram model which rep- 
resents the concatenation of semantic units (semantic units are associated to states 
in the figure). This model is learnt from the sequences of semantic units associated 
to each sentence of the training set. The sequence of semantic units in the above 
example is: 



consulta <horajalida> marcadorjlestino ciudadjdestino 
(query <departJime> destination Marker destination _city) 

- The Models for the Semantic Units. The structure of each semantic unit can be es- 
tablished directly in terms of words (word bigram models). This approach produces 
models that are very big and very dependent on the vocabulary of the application. 
For this reason, we propose an alternative method based on POS tags. To do this, we 
use a Spanish POS tagger [9] which supplies the corresponding POS tag for every 
word. In this situation, we obtain a new training set annotated with morphological 
inf ormation. For each semantic unit we learn a HMM in which the states represent 
POS tags and the words are emitted from them according to certain lexical proba- 
bility (Figure 2 (b)). This HMM is estimated from the segments of POS associated 
to this semantic unit. 

Once the different models have been learnt, a regular substitution of the lower mod- 
els into the upper one is made. In this way, we get a single integrated model (Figure 2 
(c)) which shows the possible concatenations of semantic units and their internal struc- 
ture. This integrated model includes the transition probabilities as well as the lexical 
probabilities. 
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Fig. 3. An example of a Lexicalized State 



In order to obtain a more accurate modelization of the semantic units, we used 
a technique to enrich the HMM [10]. This technique consists of incorporating new 
categories to the POS tag set. These new categories are strongly related to some selected 
words, which can be established empirically from the training set or following other 
criteria. We obtain lexicalized models with this process. Although this lexicalization 
produces more complex models, semantic units models are improved. For instance, if 
we lexicalize the prepositions ‘to’ and ‘from’ we can distinguish between two strongly 
different meanings in the railway timetable task. 

In Figure 3, we show the effect of this lexicalization over a generic state belonging 

to a certain syntactic unit, when it is particularized for a certain word Wi. In this way, we 
obtain a new state (filled state in the figure) in which only the word Wi can be emitted 
with lexical probability equal to 1. 

Our system can be considered as a two-level transducer. The upper level describes 
contextual information about the structure of the sentences, and the lower level modelizes 
the structure of the semantic units considered. 

The Understanding process consists of finding out the sequence of states of maximum 
probability on the integrated model for an input sentence. Therefore, this sequence must 
be compatible with the contextual and lexical constraints. This process can be carried 
out by Dynamic Programming using the Viterbi algorithm, which we adapted to our 
models. From the Dynamic Programming trellis, we can obtain the best segmentation 
of the input sentence into semantic units. 

4 Experimental Results 

The language understanding model obtained was applied to an understanding task which 
was integrated into a spoken dialog system answering queries about a railway timetable 
in Spanish [1]. 

The corpus consisted in the orthographic transcription of a set of 215 dialogs, ob- 
tained through a Wizard of Oz technique. Using only the user utterances, we defined 
a training set of 175 dialogs with 1,141 user utterances, and a test set of 40 dialogs with 
268 user utterances. The number of words in these two sets was 1 1,987 and the medium 
length of the utterances was 10.5 words. 

We used several measures to evaluate the accuracy of the models: 

- The percentage of correct sequences of semantic units (%cssu). 

- The percentage of correct frames (%cf). 
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- The precision (%P), that is, the rate between the number of correct proposed semantic 
units and the number of proposed semantic units. 

- The recall i%R), that is, the rate between the number of correct proposed semantic 
units and the number of semantic units in the reference. 

O y p y D 

- The score T/ 3 =i = p\ R ^ which combines the last two rates {%P and %R). 

We evaluated the segmentation accuracy and the correct interpretation of the user 
utterances using these measures. 



Table 1. Experimental results 



Models 


% cssu 


%cf 


%P 


%R 


Ffs=i 


BIG-BIG 


32.3 


41.0 


55.9 


51.0 


53.3 


BIG-BIG-word 


58.7 


67.3 


78.9 


79.2 


79.0 


BIG-BIG-lemma 


59.9 


72.5 


79.6 


81.0 


80.3 



In Table 1, we show the experimental results with three approaches: a two-level 
approach using POS bigram models in the two levels (BIG-BIG), a two-level approach 
using POS bigram models which were lexicalized taking into account the most frequent 
words (BIG-BIG-words), and a two-level approach using POS bigram models which 
were lexicalized taking into account the most frequent lemmas (BIG-BIG-lemmas). In 
the third case, the lexicalized units were lemmas instead of words. These lemmas were 
obtained through a morphological analyzer. 

On one hand, it can be observed from Table 1 that there is a big difference between 
the %cssu and the %c/ measures. This difference is due to the fact that, although the 
obtained semantic sentence is not exactly the same as the reference semantic sentence, 
their corresponding frame is the same. 

On the other hand, we observed that the best performance was achieved using lexi- 
calized models. That is because the lexicalization process gave a more accurate relation 
between words and semantic units. In the BIG-BIG-word model, the best results were 
achieved taking into account the words whose frequency in training data was larger 
than 9. In the BIG-BIG-lemma model, we considered lemmas instead of words in order 
to specialize the model. This kind of specialization slightly improves the performance 
of the system. 

In [14], some results on the same corpus are presented using word bigram models. 
Although this approach gave a slightly better performance than our approach, models 
based on categories are more independent from the task than models based on words. 

5 Conclusions and Future Work 

We have presented an approach to Natural Language Understanding based on Stochastic 
Models, which are automatically learnt from data. In particular, we have used some 
techniques from the areas of POS tagging and Shallow Parsing. 

The evaluation has been done on a task of language understanding in a Spanish dialog 
system which answers queries about a railway timetable. Considering that the available 
data was small, the results were relatively reliable. 
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In any case, from the experimental results, we observed that the best performance was 
achieved using lexicalized models. That is because the lexicalization process gives a more 
accurate relation between words and semantic units. We hope that a more appropriate 
dehnition of the list of words/lemmas to be lexicalized will provide better performance. 

On the other hand, the incorporation of contextual knowledge from the dialog man- 
ager, that is, the prediction of the next dialog act, could improve the models. 



Acknowledgments 

This work has been supported by the Spanish Research Projects CIC YT TIC2000-0664- 

C02-01 and TIC98-0423-C06-02. 

References 

1. A. Bonafonte, P. Aibar, N. Castell, E. Lleida, J.B. Mari no, E. Sanchis, and M.I. Torres. De- 
sarrollo de un sistema de dialogo oral en dominios restringidos. In Proceedings ofIJornadas 
en Tecnologia del Habla, 2000. 

2. T. Brants. Cascaded Markov Models. In Proceedings of the EACL99, Bergen, Norway, 1999. 

3. P. Clarksond and R. Ronsenfeld. Statistical Language Modeling using the CMU-Cambridge 
Toolkit. In Proceedings of Eurospeech, Rhodes, Greece, 1997. 

4. S. M. Katz. Estimation of Probabilities from Sparse Data for the Language Model Component 
of a Speech Recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 35, 
1987. 

5. L. Lamel, S. Rosset, J.Gauvain, S. Bennacef, M. Gamier-Rizet, and B. Prouts. The LIMSI 
ARISE System. Speech Communication, 31(4):339-353, 2000. 

6. E. Levin and R. Pieraccini. Concept-Based Spontaneous Speech Understanding System. In 
Proceedings of EUROSPEECH’95, pages 555-558, 1995. 

7. W. Minker. Stocastically-Based Semantic. Analysis for ARISE - Automatic Railway Infor- 
mation Systems for Europe. 2(2): 127-147, 1999. 

8. F. Pla, A. Molina, and N. Prieto. Tagging and Chunking with Bigrams. In Proceedings of 
the COLING-2000, Saarbriicken, Germany, August 2000. 

9. F. Pla, A. Molina, and N. Prieto. An Integrated Statistical Model for Tagging and Chunking 
Unrestricted Text. In Proceedings of the Text, Speech and Dialogue 2000, Brno, Czech 
Republic, September 2000. 

10. F. Pla, A. Molina, and N. Prieto. Improving Chunking by means of Lexical-Contextual 
Information in Statistical Language Models. In Proceedings of 4th CoNLL-2000 and LLL- 
2000, Lisbon, Portugal, September 2000. 

11. E. Sanchis, E. Segarra, M. Galiano, E. Garcia, and L. Hurtado. Modelizacion de la Compresion 
mediante Tecnicas de Aprendizaje Automatico. In Proceedings of I Jornadas en Tecnologia 
del Habla, 2000. 

12. R. Schwartz, S. Miller, D. Stallard, and J. Makhoul. Language Understanding using hidden 
understanding models. In ICSLP, pages 997-1000, 1996. 

13. E. Segarra, V. Arranz, N. Castell, I. Galiano, E. Garcia, A. Molina, and E. Sanchis. Repre- 
sentacion Semantica de la Tarea. In Internal Report UPV DSIC-II/5/00, 2000. 

14. E. Segarra, E. Sanchis, M. Galiano, F. Garcia, and L. Hurtado. Extracting Semantic In- 
formation through Automatic Learning Techniques. In IX Spanish Symposium on Pattern 
Recognition and Image Analysis-SNREAI’OI , 2001. 




Shallow Processing and Cautious Incrementality 
in a Dialogue System Front End: 

Two Steps towards Robustness and Reactivity 



Torbjorn Lager 

Uppsala University, Department of Linguistics 
Torb j orn . Lager®! ing.uu.se 



Abstract. This paper presents the design and implementation of a simple and 
robust dialogue system front end which performs rule-driven, incremental pro- 
cessing of user contributions. We describe how a particular instantiation of the 
front end can be made to perform a variety of tasks such as part of speech dis- 
ambiguation, word sense disambiguation, noun phrase detection, and dialogue 
act recognition, always in an incremental manner. Still, incrementality is never 
allowed to compromise the accuracy of the disambiguation decisions taken by 
the system. Incrementality is cautious', when correctness is at stake, decisions are 
delayed, until more information becomes available. Furthermore, the format of the 
necessary rules is very simple and uniform across different tasks, and rules can be 
learned automatically from tagged dialogue corpora, using transformation-based 
learning. 



1 Introduction 

In human-human dialogue, the speaker’s contributions are typically fragmented and 
repetitions and self-repairs are common. This does not seem to cause problems for the 
hearer. Most dialogue systems, however, have difficulties with this kind of input, and 
very likely an utterance such as “please rephrase” is to be expected in response to it. 
Such dialogue systems lack what is commonly referred to as robustness. 

Also in human-human dialogue, the speaker’s contributions are typically accompa- 
nied by the listener’s feedback: reassuring positive feedback if the listener perceives, 
understands and accepts the speaker’s utterances, negative feedback to alert the speaker 
to possible misunderstanding or disbelief. However, when speaking (or writing) to a typ- 
ical dialogue system, the user may produce a very long utterance, hand the turn over to 
the system (e.g. by pressing the return key), and only then - far too late - meet with 
a negative response pertaining to information being given early in the utterance. Such 
dialogue systems show a lack of what is often referred to as reactivity. 

Very likely, a system lacking robustness relies on a module - a deep parser - which 
tries to impose a phrase structure tree, and perhaps also a logical form, on an input 
assumed to be a sentence. Since such a parser often rejects correct low-level parses 
because they do not fit into a global parse, it often fails to provide any analysis at all for 
‘ill-formed’ input. A way to cure brittleness suggests itself: replace (or complement) the 
deep parser with a shallo'w parser which on the basis of the output of a part of speech 
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tagger performs chunking, and which will always produce an analysis, for any kind of 
input, regardless of how noisy and ungrammatical this input is. In addition, shallow 
processing may target not only syntactic analysis. Indeed, in the present paper, we will 
be concerned with shallow processing techniques for limited semantic and pragmatic 
processing as well. Finally, shallow processing techniques often come with learning 
methods by means of which processors can be automatically trained or retrained on 
corpus texts. They are therefore in general straightforward to port from one domain and 
sublanguage to another. In contrast, configuring (say) a deep parser for a new domain 
or sublanguage often involves a fair bit of manual writing of grammars and lexica. 

This paper presents the design and implementation of a simple dialogue system 
front end which performs rule-driven, incremental processing of user contributions and 
produces rich, albeit shallow, representations of their form, content and function. We 
describe how a particular instantiation of the front end can be made to perform a vari- 
ety of tasks such as part of speech disambiguation, noun phrase detection, word sense 
disambiguation, and dialogue act recognition, always in the incremental manner neces- 
sary for truly reactive systems. However, incrementality is not allowed to compromise 
accuracy of the disambiguation decisions taken by the front end. When correctness is 
at stake, decisions are delayed, until more information becomes available. Furthermore, 
the format of the necessary rules is very simple and uniform across different tasks, and - 
as a way to boost portability - rules can be learned automatically from tagged dialogue 
corpora. 



2 Transformation-Based Tagging and Learning 

Transformation-based tagging (Brill [1]) has been part of the computational linguist’s 
standard repertoire for a number of years now. It is a simple rule-based framework, which 
comes with a learning method called Transformation-Based Learning (TBL). Among 
its strengths we find: Transformation-based taggers perform well (in terms of tagging 
accuracy). Transformation-based learning does not need much training data in order 
to learn rules that are reasonable. Rules can be easily understood, and (therefore) they 
can be edited manually, should we want to. Finally, transformation-based taggers are 
compact, and can be implemented very efficiently. 

Since Eric Brill first introduced TBL it has grown very popular. An ongoing attempt 
to compile a bibliography on TBL-related papers (see Lager [5]) lists 59 papers, involv- 
ing 38 different authors, dealing with natural language processing tasks as diverse as part 
of speech tagging, unknown word guessing, grapheme to phoneme conversion, morpho- 
logical analysis, word sense disambiguation, spelling correction, text-chunking/parsing, 
prepositional phrase attachment disambiguation, dialogue act tagging, and ellipsis res- 
olution. 

The /x-TBL system - described in detail in (Lager [2])' - implements a general- 
ized form of transformation-based learning. Through its support of a compositional 
rule/template formalism and ‘pluggable’ algorithms, the /r-TBL system can easily be 
tailored to different learning tasks. In particular, it has been used to learn rules for the 

* and available from http : / /www . ling . gu . se/~lager/mutbl . html 
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various analysis tasks described in the present paper: part of speech disambiguation, 
noun phrase detection, word sense disambiguation, and dialog act recognition. 



Part of Speech Disambiguation. This is the original TBL application area. A lexical 
lookup module assigns exactly one tag to each occurrence of a word (usually the most 
frequent tag for that word type), disregarding context. A rule application module then 
proceeds to replace some of the tags with other tags, on the basis of what appears in the 
local context. For example, in the ^-TBL system’s rule formalism, the rule 

pos : ' NN' > ' VB ' <- pos : ' TO' @ [ - 1] 

means “replace tag NN with tag VB if the previous (- 1 ) word is tagged TO”. Conditions 
may refer to different features in the input, conditions may look backward or forward, 
and complex conditions may be composed from simpler ones. Two or more rules may 
be connected into sequences - or composed - by means of the composition operator o, 
where R o Rs basically means that the output of applying the rule R forms the input 
to the application of the rules Rs. Here is an example, showing the top four rules in 
a longer sequence: 



pos : ' NN' > ' VB ' <- pos : ' TO' @ [ - 1] o 

pos : ' VB' > ' NN' <- pos : ' DT' @ [ - 1 , -2] o 

pos : ' IN' > ' RB ' <- wrd:as@[0] & wrd;as@[2] o 

pos : ' IN' > ' WDT' <- pos : ' VB' @ [1 , 2 ] o 

Typically, a part of speech tagger for English consists of a couple of hundred rules 
and achieves an accuracy of 96-98% when evaluated on a test corpus similar to the 
corpus it was trained on. 



Noun Phrase Detection. Based on output from a part of speech tagger, Ramshaw and 
Marcus ([7]) developed a noun phrase chunker. The idea is to view chunking as a tagging 
problem, and to encode the chunk structure as tags attached to each words. Ramshaw 
and Marcus used three tags -1,0 and B - to indicate if a word occurrence is inside an 
NP, outside an NP, or on the border between two NPs, respectively. Inspired by their 
work. Lager ([4]) used the /r-TBL system to train an NP-chunker on 150,000 words of 
WSJ corpus, which produced 100 rules, the top five of which are shown here: 



np: 


b 

A 

H 


< - 


np: 


' O' @ [1] & pos : 


' JJ'@[0] 


O 




np: 


PQ 

A 

H 


< - 


np: 


' I ' @ [ - 2 ] & np : 


' I'@ [-1] 


& pos : ‘ 


'DT'®[0] o 


np: 


H 

A 

b 


< - 


np: 


'0'@[-2] & np: 


' I'® [-1] 


Sc pos : ‘ 


'DT'®[-1] O 


np: 


H 

A 

b 


< - 


np: 


' I'@ [-1] & pos 


: 'CC'@[0] 


Sc pos 


:'NN'®[1] O 


np: 


'O' >' I' 


< - 


np: 


' I ' @ [ 1 ] & np : ' 


I'@[2] & 


wrd : about® [0] o 



Evaluations of this kind of chunker show that the accuracy can be expected to land 
in the range of a respectable 90-94%. 



Word Sense Disambiguation. In (Lager [4]) the /x-TBL system was used for learning 
rules for word sense disambiguation. In a small experiment, the task was to assign one 
out of six sense markers to each occurrence of “interest” or “interests” (markers meaning 
l=“readiness to give attention”, 4=“advantage, advancement, or favour”, 5=“a company 
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share”, 6=“money paid for the use of money”, etc.). The point of departure was the idea 
that the sense of an occurrence of a word can often be determined from just looking at 
the two previous words and the two following words. Training resulted in a sequence of 
273 rules, some of which are shown below: 

sense :6>1 <- wrd;in@[l] o 
sense:l>5 <- wrd; ' %'@ [-1, -2] o 
sense:6>5 <- wrd; short® [-1] o 
sense:6>4 <- wrd ; best® [ - 1 , -2 ] o 
sense :5>4 <- wrd;of®[l] o 

The resulting sense disambiguator reached an accuracy of 88%, which is on par with 
other word sense disambiguation methods. 



Dialogue Act Recognition. Inspired by the work reported in (Samuel et al. [8]), Lager 
and Zinovjeva ([3]) used the p-TBL system to learn dialogue act tagging rules from 
a subset of the Maptask corpus of instructional two-person dialogues. 

The idea here is that conditions for changing the tag of an utterance are sensitive to 
the actual words and word combinations used in the utterance, the length of the utterance, 
the previous dialogue act(s), the speaker’s role in the dialogue, and whether the speaker 
has changed since the previous utterance. 

The learning system found many rules where ‘cue-words’ indicate dialogue acts of 
various kinds: 

dact ; ack>reply_n <- u_mem: ' No ' ® [0] 
dact ; ack>reply_y <- u_f irst : ' Uh-huh' ® [0] 
dact ; ack>check <- u_f irst : ' So ' @ [0] 

It is interesting - and also typical of how transformation rules work - that the second 
of those rules - which changes an acknowledge tag into a yes-reply tag in the presence 
of the word ‘Uh-huh’ - is later followed by a rule which reverses that change again if 
the utterance is preceded by an instruct act: 

dact ; reply_y>ack <- dact : instruct® [-1] & u_mem Uh-huh' ® [0] 

Other rules capture well-known regularities, e.g. the tendency that questions are 
often followed by replies 

dact ; explain>reply_w <- s_change :yes® [0] & dact ;query_w® [-1] 

or that replies are usually not followed by other replies: 
dact ; reply_n>ack <- dact : reply_n@ [ - 1] 

Certain bigrams signal other dialogue acts: 
dact ; explain>query_yn <- u_bigram; (do, you) ® [0] 

The learning process resulted in a sequence of 348 rules, by means of which the 
test corpus could be tagged with an accuracy of 62.1%. This result is not as good as 
the 75.1% result reported by Samuel et al., but this can probably be explained by the 
particular characteristics of the Maptask corpus (long and varied dialogues), and the 
comparatively short time invested in the task. 
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3 Cautious Incrementality 

3.1 Incrementality and Accuracy 

The present study is largely motivated by a wish to see how well simple systems of 
transformation rules, such as the ones presented in the previous section, would work in 
a dialogue system setting. The underlying assumption is of course that dialogue systems 
- and not just text processing applications - can make good use of part of speech taggers, 
word sense disambiguators and the like, and that robustness is an important property 
also in this context. 

However, traditional transformation-based taggers were built for batched processing 
of large quantities of text, and their implementations did not need to support incremental 
processing. What must be developed is a way to perform analysis in an incremental 
fashion. 

So what is the proper unit of incremental processing? The unit of a single word 
is of course a good candidate. By word-by-word incrementality we mean the kind of 
incrementality which insists on immediately analyzing each word as it is produced by 
the user. However, it is important to note that it is not always possible to make sense 
of a single word in the light of only the past (preceding) context. Very often the future 
(succeeding) context is relevant as well. For example, consider the not-yet-complete user 
utterance^ 

U: . . . in the light | 

and the task of determining the status of the word “light” (its part of speech, its 
sense etc.). The system should know, at this stage, that “light” is not a verb. The verb 
reading is ruled out by the preceding determiner. However, “light” may still be a noun 
or an adjective. The system may guess, of course, but then an error might result, and the 
error may trigger unwanted behaviour in the system. There is clearly a tradeoff between 
incrementality and accuracy involved here. 

In the present paper, we make the following design decision; We choose to not 
compromise accuracy. We require that our processing algorithm be complete with respect 
to the semantics of transformation rules, i.e. that incrementality does in no way hurt the 
performance of the original taggers. We call this strategy cautious incrementality, and 
although cautious incrementality is (sometimes) less incremental than word-by-word 
incrementality, we believe that it is sufficiently incremental for forming a good basis for 
reactiveness in dialogue systems. 

3.2 The Algorithm 

Our strategy in the case of the above example is to suspend processing and wait for the 
next word to be produced. If this word is (say) “of” then “light” must be a noun, but if 
the next word is “mood” then “light” is an adjective. Note that when we say “suspend 
processing” we do not suggest that the entire dialogue system process should be put 
in a wait state. Concurrency is assumed, and therefore only the process ‘in charge of’ 
assigning the part of speech to this particular word token needs to be put on hold. 

^ The vertical bar symbol indicates the location of an insertion point. 
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In more detail: To process a word token Wi in an utterance (or an utterance token Ui 
in a dialogue), a new process Pi is created whose only mission is to assign the correct 
tag to Wi (or Ui), based on information about its context. This process will be alive until 
this mission is over, and will then die. The process will suspend (but will still be alive) 
if some relevant context of Wi (or Ui) does not yet exist (i.e. it has not been produced 
by any dialogue partner yet), and will wake up and continue once the relevant context 
comes into existence. 

For each task we want the system to perform (part of speech tagging, word sense 
disambiguation etc.), we need a default assignment of tags (e.g. a lexicon) and a sequence 
of rules i?i , Rn of the kind presented in the previous sections. 



3.3 A Small Example 

In this small example, we will examine in detail how an incremental part of speech tagger 
consisting of only the ‘lexicon’ 

Lex: light='VB' of='IN' the='DT' ... 

and two rules (i?i and R2) 

pos : ' VB' > ' NN' <- pos:'DT@[-l] o 
pos : ' NN' >' JJ' <- pos:'NN'@[l] 

operates on a sequence of words W \, .., W3 = “the light of”, being written word by 
word, to an input channel. 

- User writes “the ” (i.e. W\ =“the”). As the space character is written, a new process 
Pi is started. Lex assigns DT to Wi, and Pi then dies. The state of the evolving 
utterance can be represented as the/DT. 

- User writes “light ” (IU2). A new process P2 is started. Initially Lex assigns VB 
to W2- The rule Ri is activated and W2 is (temporarily) assigned NN (since the 
previous word is indeed a DT). Rule R2 is partly triggered, but since the condition 
pos : NN@ [ 1 ] refers the future, P2 suspends there, waiting, as it were, for the future 
to happen. At this stage, the state of the evolving utterance can be represented as 
the/DT light/-, where underscore means that the part of speech of “light” has 
not yet been determined. 

- User writes “of” (IU3). A new process P3 is started (P2 is alive, but still suspended). 
W3 is assigned IN and since no rule is applicable, P3 terminates with IN as the 
final result. In the mean time P2 wakes up, determines that the token of “light” 
(IU2) is indeed a noun (since the condition pos :NN@ [I] is not satisfied), and 
then terminates. The state of the evolving utterance can be represented as the/DT 
light/NN of/IN. 

This is very simple, but it can indeed be generalized to all the other disambiguation tasks 
described in this paper. Note for example that since a rule Ri always triggers on the 
results of applying the rules Ri-i on the neighbouring words, deadlock is effectively 
avoided. 
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4 A Prototype Implementation 

A prototype implementing the design proposed in the present paper has been written in 
the Oz programming language (Mozart [6]). This was very straightforward, since Oz 
supports concurrent processes in the form of threads, as well as dataflow variables to 
synchronize them. This gives us the whole delaying mechanism - and thus incrementality 
- practically for free. 

The state of the evolving dialogue is represented by means of an open record, i.e. 
a record which is ‘side-ways open’ in the sense that new features can always be added 
to it. The features are integers representing the positions of utterances in the dialogue, 
and the positions of words in utterances. Such records - as well as constraints over such 
records - are available as language primitives in Oz. 

The following record represents the state just after the system and the user has written 

S : do you need a loan? 

U: yes to low interest as well | 

but before the closing of the user’s utterance: 
d ( 1 : u ( speaker : system 



dact : query yn 
wrd : w ( 1 : do 2 : you 


3 meed 


4 :a 5 : 


loan 


6:'?') 




pos:p(l: 'VBP' 2: 


PRP' 3: 


' VB' 4 


: ' VB 


5 : 'NN' 


6 : ' 


np:n(l:'0' 2:'I' 


3 : 'O' 4 


: ' I' 5 


: ' I ' 


6 : 'O' ) 




sense : s (1 : ' ' 2 


' ? ? ' 3 : 


' ? ? ' 4 


: ' ?? 


5 : ' ? ? ' 


6 : ' ?? 



2 : u ( speaker : user 
dact : reply_y 

wrd:w(l:yes 2 : to 3 : low 4 : interest 5: as 6: well ...) 
pos:p(l:'UH' 2:'TO' 3:'JJ' 4:'NN' 5:_ 6;'RB' ...) 

np:n(l:'0' 2;'0' 3:'I' 4:'I' 5:_ 6;_ ...) 
sense;s (1: ' ??' 2:'??' 3:'??' 4:6 5:'??' 6:'??' ...) 

. . .) 

In accordance with the rule 

dact : explain>query_yn <- u_bigram: (do, you) @ [0] 

the dialogue act of the first utterance was determined exactly at the point when the 
system had generated “do you ’’. 

The dotted parts show that the second utterance, as well as the whole dialogue, are 
still open. Note that the part of speech of “as” has not yet been determined, since the 
process working on it still needs to discharge the rule 

pos : ' IN' > ' RB ' <- wrd:as@[0] & wrd;as@[2] o 

That is, it still remains to be seen if the user will continue with another occurrence 
of “as”. If he does, the first occurrence of “as” will receive the tag RB, but if he does 
not, it will be tagged IN. Note that the token of “interest” has received the sense tag ‘6’, 
which suggests it means “money paid for the use of money”. 
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5 Conclusions and Further Work 

We have shown that, by means of a simple form of synchronized concurrency, rule- 
based processing of dialogue input, in a shallow and (almost) fully incremental fashion, 
is possible. We have shown that the resulting representations can be fairly rich and 
accurate, and we have argued that the use of transformation-based learning enhances the 
portability of a dialogue system front end built on these ideas. 

In the future, we would like to learn rules from dialogue corpora collected from 
(simulations of) a particular task in a particular domain, and to connect the proposed 
front end to a back end (dialogue manager etc.), in order to build a full-blown dialogue 
system for that task and domain. 
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Abstract. The development of computerized information retrieval dialogue sys- 
tems communicating with the user in natural language requires the implementation 
of an effective training procedure with the aid of which the main modules of the 
dialogue system can be partly automatically developed. The presented paper de- 
scribes an attempt to create the sentence templates automatically, using a special 
program package implementing an especially developed method of a quantitative 
linguistic analysis of transcribed real dialogues. Firstly, the program package gen- 
erates a set of formulas (templates) consisting of elements of a special grammar 
and describing the syntactic structure of required sentences. Secondly, it gener- 
ates a large corpus of unique training sentences using the sentence templates and 
a stochastic context-free grammar. The experimentally created corpus was used 
for the training of modules of a city information dialogue system. 



1 Introduction 

Modern computerized information retrieval dialogue systems communicating with the 
user in some form of natural language seem to be an effective and comfortable tool to 
obtain exact information on bus, train, or plane departures/arrivals, on the products on 
offer in hypermarkets and department stores, theater or cinema performances, interest- 
ing landmarks and many others. Such dialogue systems are at present usually developed 
by training their components (recognizer, linguistic analyzer, dialogue manager) using 
several training corpora that have to be created manually or with the aid of some artifi- 
cially designed supporting means. For the creation of a corpus of training sentences we 
normally use some kinds of sentence generators, but a set of generating templates de- 
scribing the sentence structures with corresponding probabilities of occurrence of their 
elements had to be created exclusively manually, on the basis of a rigorous statistic and 
linguistic analysis of possible sentence structures. 

Therefore we would like to present some proposals which could improve the concep- 
tion of automatic creation of training corpora, make them more natural and acceptable. 
For this purpose we used linguistic methods of interaction (conversation) analysis and an 
extensive corpus of authentic spoken data. Since the digital recording of real dialogues 
in information or shopping centres doesn’t cause difficulties, we tried to utilize recorded 
real dialogues from Czech tourist information centres for the automatic generation of 
generating templates. We decided to restrict to the task “recognition of the first turns in 
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information demands”. Based on a smaller part of the corpus, a set of typical syntactic 
constructions occurring in Czech dialogue beginnings was compiled. Their elements 
and combination rules served to create a number of turn templates. A special program 
package automatically a training corpus of user demands was created. 

2 Theoretical Basis of Linguistic Analysis 

Theoretical, systemic-functional and functional-generative linguistics have carried im- 
portant knowledge on parsing and disambiguation of sentences, on linguistic structures 
and functions. But since we cannot consider single sentences or written texts to be 
part of spoken conversation, we have to analyse natural occurring talk-in-interaction. 
That means we have to analyse the functioning of turn taking and co-operation between 
speakers, how they manage misunderstandings, which linguistic elements are used to 
structure and conduct the course of dialogue, which elements show the communication 
partner that there is something wrong, and which elements can prompt specific ways of 
interpreting communicative functions (illocutionary forces) of utterances by speakers. 

Therefore we want to refer to the achievements of interactional linguistics and con- 
versation analysis - two linguistic disciplines, which deal with the detailed analysis of 
authentic everyday talk and talk in various institutional settings. Especially important 
are the linguistic findings on dialogue and turn construction, because they can help us 
to understand how dialogue systems have to be created in order to answer the needs of 
the user. 

Authentic, naturally occurring utterances (and consequently also demands of clients 
in tourist information centres) need not to be complete and/or correct in the sense of 
grammar and linguistics. These elliptical and other “incorrect” constructions are never- 
theless fully functional, even if they occur at the beginning of the dialogue: 
SB_190899_176 ”wc” (1) 

01 KF; good afternoon 

02 IF; good afternoon, please 

03 KF; I just wanted to ask bathrooms toilets 

The construction of authentic talk-in-interaction follows its own rules, which we have 
to describe with respect to our specific corpus and the adaptability to the creation of 
automated dialogue systems in Czech language. It is therefore not useful to work only 
with terms of systemic linguistics or other linguistic theories relating to “correctness” or 
grammaticality of language means and use. Besides the term sentence, meaning a certain 
pattern (paradigm) on the systemic level of language, we should use the terms utterance 
(realization of a sentence), turn (part of the dialogue produced by one speaker without 
interruption), sequence (at least two turns referring to each other and building a more 
complex part of the dialogue, e. g. question-answer), turn construction unit (TCU) (“the 
smallest linguistically possible unit..., with one or more than one TCU constituting 
a possible turn...” [5],p. 14), turn fragment (unfinished stretches of talk [4],p. 3). 

The communicative function of a speaker’s utterance in face-to-face-talk need not 
to correspond with its context-free semantic interpretation. With regard to the following 
turn of the other speaker it is possible to interpret the communicative function of a turn 
correctly. For instance, questions on the location of certain objects (sights) in town are 
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usually understood to be demands for direction giving, and utterances asking about 
the wish of the client are interpreted as an invitation to pose a question or demand. 
Furthermore, there are some particles and dialogue markers in Czech language, which 
are potentially polyfunctional [2] . Only their position in the dialogue, their weight (the 
capability to stand alone, i. e. to represent a whole turn or TCU) and their prosodic 
characteristics (intonation, pauses, tone pitch ...) determine the several communicative 
functions of this linguistic elements: 

WB_120899_001 ’’railway station” (2) 

012 KF; no 

(feedback signal, shows attention, stands alone) 

088 KM; no 

(positive answer, stands alone) 

089 IF; no (...) to nadrazi je ... 

(filler word without stand-alone or explicit meaning) 



3 Methodical Approach 

First of all, it is necessary to have a big amount of authentic material from the sphere of 
interest. Our corpus contains about 500 dialogues with an overall length of more than 
13 hours. This material has to be transcribed and analysed in order to determine the 
course of information dialogues and to assess, which dialogues and turns are typical. In 
the first phase of our project, we concentrated on initial phrases containing a concrete 
information demand of a client. 

The next step is to analyse their construction principles. As we stated above, each 
turn can consist of one or several turn construction units (TCUs). One important criterion 
of TCUs is their potential completeness, they can appear as syntactically complete sen- 
tences, noun phrases (NP), prepositional phrases (PP) or even one-word-constructions. 
Frequent TCUs are for instance greetings and contact formulas. In spoken dialogues, it 
is crucial to take into consideration all possibly constituting aspects of a TCU: syntac- 
tic construction, semantic structure, pragmatic meaning (communicative function) and 
prosodic characteristics, because the findings of conversation analysis in other languages 
let expect an interplay of all these aspects. These aspects do not necessary correspond - 
there are cases, in which only one aspect fulfils a distinctive function. 

According to the various, but not numerous kinds of services which are offered hy 
tourist information centres (information on local and regional sights, institutions and 
so on; direction giving; sale of post cards, stamps, maps and souvenirs ...) there is 
a certain spectrum of syntactical turn-constructions with similar characteristics - they 
differ, however, with respect to explicitness, word order or absence resp. appearance of 
further elements like politeness formulas. That means that many of the templates created 
by the program (to understand the following description see please the next paragraph) 
have a number of elements in common. The templates consist of a finite number and 
a certain kind of combination of non-terminal symbols (some of them are TCUs), but 
their concrete linguistic realization (see the terminal symbols) is quite multifarious. 

For our purpose it is nevertheless necessary to go beyond the boundary of TCUs, 
because most of the demands are formulated using the same or very similar syntactic 
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constructions. In order to distinguish them and to create derivation rules counting for 
the templates of the whole corpus, we have to define the singular components of some 
syntactic constructions, their role and context status. The structuring of the turns into 
combinable components is partly based on valency theory (that means, that the valency 
positions of the predicate are represented each by one non-terminal element), partly 
it is even deeper (numerals and determiners are treated as single components since 
they regularly occur at the beginning of certain types of non-terminal elements). Some 
examples for the structuring of user demands are: 



NB_010999_002 ’’wish” 

©INTENT 

©EXIST 

©DEIX\_HERE 

©INTOWN 

©INST\_NOM\_S 



I just wanted to ask uh 
is 

here 

in jaromerice 
a postal investment bank 



(3) 



The relationship between the single non-terminal symbols (realized by smaller linguistic 
units like noun phrase (NP), verbal phrase (VP), prepositional phrase (PP), numeral 
(Num), pronoun (Pron) and so on) and the dialogue unit “turn” is insofar relevant, as 
there are turns with only one and turns with more symbols. But only a few non-terminal 
symbols are really capable to serve as a stand-alone turn. In most of the cases we have 
to deal with several non-terminal symbols, which can be combined moreover in several 
ways. They have various syntactical weight, i. e. they can be predicative units providing 
valency positions, which have to be occupied, or non-predicative units serving just to 
occupy free valency positions (NPs, PPs). The latter can appear also alone, word order 
is quite variable. 

On the basis of our test corpus the syntactic and dialogic status of non-terminal 
symbols can be defined as follows (examples): 



©GREET 


NP, TCU 


©INTENT 


VP = incomplete, S = TCU, pre-position 


©ITEM 


NP 


©PLACE 


PP 


©QUANT 


Num 


©EXIST 


VP, incomplete 


©DET 


Pron 


©INDET 




©POLITE 


Particle, pre-, central, or post-position 
independent , TCU 


©INST 


NP 


©POSSIBLE 


VP, incomplete, post-position 



There are some elements, which are capable to take a pre- (first) position and necessarily 
require a completion because of the regularities in our authentic corpus, i.e. in spoken 
language. These positionally dependent elements are very important for the automatic 
generation of templates, because they are regularly followed by another element, which 
can thus be predicted. This fact has to be included in the program and can help to 
recognize at least the status of some unclear words or passages. Examples for positionally 
dependent elements (symbols) in user demands at the beginning of the dialogue (this 
need not count for any other dialogue position) are @DET, @INDET or ©QUANT: 
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[©POLITE] 50 [©QUANT] 50 ©DET\_ACC\_P @ITEM\_ACC\_P 
©INTENT ©INDET\_ACC\_SF @ITEM\_ACC\_SF 
©INTENT ©DET\_NOM\_SN ©INST\_N0M\_SN 
Most of the expressions representing positionally dependent non-terminal symbols are 
facultative; especially the inflationary and redundant use of determiners in Czech spoken 
language is a well known fact. Elements, which are incomplete with respect of valency 
(©EXIST, ©POSSIBLE or ©ACCESS), behave analogically: They are regularly fol- 
lowed or preceded by ©ITEM, ©INST or ©ACTION performed by a finite number of 
possible elements (terminal symbols representing more or less complex NPs with or 
without determiners resp. infinitives): 

ITEM mapu zpravodaje program 

INST z\'{a}mek rozhledna toalety I\ ' { e }k\ ' { a} rna hudebniny 

baz\'{e}n ' ' vyhl\ ' { i } dkov\ ' { a} v\v{e}\v{z}" ' 'muzeum 

um\v{e }n\ ' { \i } " banka galerie informace 
ACTION ubytovat ''jezdit na koni" ' 'vym\v{e}nit pen\'{i}ze" 
These elements can also - as already mentioned - occur without their syntactically bound 
part, the predicate. In these cases there is almost always preceding a determiner and/or 
an utterance standing for the non- terminal symbol ©INTENT: 

©INTENT ©INDET\_NOM\_P ©ITEM\_N0M\_P 
©INTENT ©DET\_N0M\_SF ©INST\_N0M\_SF 
At this stage of our project we have to take note of several problems concerning the 
structuring of units which should get the status of non-terminal symbols. These problems 
are reflecting the whole spectrum of different verbalizing modes and language use in 
spoken interaction. 



4 Creation of Sentence Corpora 

The balanced corpus of training sentences has to cover all kinds of sentence structures 
and multiforms of sounds, the distribution of sounds has to correspond to the speech 
representation. The compactness can be seen as an advantage of the corpus; it involves 
typically 5,000 or 10,000 training sentences which cannot be written manually (this 
procedure is extremely tiring and its use is disadvantageous). That means, the set of 
training sentences must be generated with a special sentence generator which guarantees 
the required statistical properties of the sentence structures. Moreover, they are composed 
of few sentence kinds (basic structures) containing a great number of different words 
and phrases. 

The stochastic sentence generator [3] is used to generate a training set of Czech 
sentences. The input file containing original sentence templates is in our case mostly 
generated automatically on the basis of the above described linguistic analysis of real-life 
dialogues (only a small part of sentence templates has to be created manually as a learning 
corpus of the generator). It has to be adapted to the universal generator structure, i. e. all 
Czech orthographic symbols have to be represented in the TjiX-notation, e. g. special 
characters as 

a = \ ' {a},i = \ ' (\i},f = \v{r},u = \accent2 3u, _ = $\_$, etc. 

The generator produces sentences with the aid of a context-free grammar. This grammar 
involves in addition to the usual standard start symbol a set of special “non-visible” 
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non-terminal symbols TEMPLi . . .TEMPL^ corresponding to the respective ques- 
tion/answer types. These symbols represent “empty” places that will be only expanded; 
they are named “templates” in this paper. The sentence templates occur in the format 

TEMPI, [@Ni]pi [@N2]p2 [@fV3]P3---, 
where symbols Ni, i = 1, 2, . . . represent the names of non-terminal symbols and 
numbers pj, j = 1, 2, . . . the occurrence probability ofthejoint non-terminal symbol; 
production rules of the grammar get the form 

Ni [@N2]P2 I @N3 \ Ti \ T2 ... 

where symbols on the right hand side of the rule separated by vertical lines represent 
alternatives of the rule, Ni G Vjv,Tj G Vp, the terminal symbols consisting of more 
words must be closed in quotation marks, e. g. “muzeum umeni”. But only one 
of them can be selected for the sentence generation (all these alternatives occur there 
with the same probability). Alternatives in brackets supplemented by an integer number 
Pk represent the optionally occurring part of the sentence (named by the non-terminal 
symbol Nk ) with the corresponding occurrence probabilities. The following Example 1 
shows a small part of the created file of templates and grammar rules (some were already 
cited above). 

Example 1 : A part of a template file 

[®POLITE] 50 [@QUANT]50 ®DET\_P ®ITEM\_P. 

[®INTENT]33 ®INTERROG\_LOC 0POSSIBLE 0ACTION [0INTOWN] 25? 

[®GREET]25 0INTENT [®ITEM\_NOM] 50 . 

[®KNOW\_I]50 0PLACE ®ACTION\_OTHER 0CONJ_AND 0INTENT ®INTERROG\_LOC . 

[®GREET]50 [©HESITATION] 50 ©INTENT 0INTERROG_IF 0EXIST\_S 
[0INTOWN15O 0INST\_NOM\_S . 

INTENT ''j\'{a} sem se \v{s}la j enom zeptat" m\'{a}m takovej 

dotaz" ''cht\v{e}li sme se zeptat" cht\v{e}l 

zeptat" ' ' m\o{u} \v{ z }u se zeptat" 

POLITE pros\' {i}m ' 'pros\' {i}m v\ ' {a}s" ' 'pros\' {i}m v\' {a}s 
p\v{e}kn\v{e} " 

INTERROG\_LOC kde ' 's ker\ ' {y} strany" kdepak . . . etc. 



The following facts can be seen as the advantages of the presented grammar defini- 
tion: 

- a great number of training sentences can be generated; 

- the word dictionary can be stored in one place and it can be later easier modified; 

- different word group significance can be considered. 

On the other hand, the impossible supervising for the sentence meaning and for the 
duplicate occurrence of some terminal symbols, e. g. local names, can be seen as the 
disadvantage of this concept. Due to this disadvantage a small number of sentences may 
be generated syntactically correct, but meaningless. But the occurrence probability of 
such sentences can be reduced by an adequately great variety of template variations. 

The structure of sentence templates of the generator template set is developed on 
an basis of the linguistic analysis of the corpus of transcribed real dialogues recorded 
in our case in 12 information offices and centers of the Czech Republic. The collected 
real queries were generalized and “standardized” from the viewpoint of the semantic 
information and 100 sentence templates were obtained on this base in the first attempt. 




424 



J. Schwarz and V. Matousek 



5 Creation of Sentence Templates 

The automated creation of training sentence templates requires the manual creation of 
only 1 template pattern further used as a “learning material” for the program initialization. 
Then the implemented program package reads the template patterns, creates an internal 
representation of the concrete grammar of sentences generated by the prepared templates, 
and analyzes the corpus of transcribed “real-life” dialogues in the last step. Based on the 
“learned” fundamental (manually written) sentence templates it creates the template to 
each analyzed dialogue request. In cases, in which the linguistic-analysis-module cannot 
classify some part or the input sentence, it asks for help of the user (from this point of 
view the creation of sentence templates doesn’t run automatic, but only automated) for 
the determination of the “unknown” part of the sentence and “learns” the user’s decision 
(new input). 

The program package consists of seven program modules written in C-H- program- 
ming language, in which the sentence structures are implemented as a simple chained 
dynamic lists. The grammar rules “learned” from the manually prepared template pat- 
terns are represented by the multi-level chained lists interconnected by especially de- 
veloped data structure. The calls of program modules are controlled by a simple PERL 
control routine, the complete program package was tested on the Sun-SPARC powerful 
workstation under Solaris 8 operating system. 

6 Experiments and Results 

When creating the templates for the sentence generation, we were confronted with several 
linguistic problems: 

First of all it is sometimes difficult to decide, which units belong together or could 
be also interpreted as separate units. This strongly depends on the word order (which 
is in Czech language quite variable) and the relatively bound position of enclitics. That 
means, we have to deal with similar constructions representing the same non-terminal 
symbol, but having different syntactic properties, for instance: mate tady (’do you 
have here’, first position in a request), tady mate (’here have you’, position after an 
interrogative). 

The second, quite serious problem is the divergence between Czech standard lan- 
guage and colloquial Czech. In everyday spoken communication there is an overwhelm- 
ing prevalence in the use of colloquial Czech, which has, of course, a lot of regional 
specifics. Nevertheless, the use of one of these codes depends on a range of situation- 
determined aspects including the relationship between the communication partners. In 
our corpus, we find a wide range of variations and combinations of both codes (code- 
switching is an extremely widespread phenomenon in Czech quotidian communication). 
Even though the clients and the information specialists don’t know each other, they com- 
municate very often in an informal, relaxed way, i. e. using colloquial Czech. Nonthe- 
less, there are people talking exclusively standard Czech (esp. older clients). What we 
do not know is: Which code will people use in order to communicate with the automatic 
dialogue system, which is more anonymous and represents a fairly new kind of commu- 
nication ? At this time, we know less about the regularities in Czech human-machine 
communication. That is why we have to take into account all problems and variations 
of code-switching between standard and colloquial Czech. 
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A third - well known and often cited - problem is, that we are concerned with a highly 
flective language. The compatibility of and congruence between terminal symbols is 
strongly influenced by this fact. Furthermore, there are more combination possibilities 
of adjectives and other linguistic means because of certain unihcation processes in spoken 
Czech (some endings are for instance enforced, others are disappearing). 

A corpus of 1,000 training sentences (possible first turns in information service 
desks) was experimentally created using a set of automated prepared sentence templates. 
The probability distribution of sentence structures was tested; it corresponds to the 
presumptions. Since the generated turns brought a lot of new information on incorrect 
modelling of some sentence templates, we repeted this procedure twice: the percentage 
of correct, i.e. reasonable turns (without dublettes) increased from 68 % to 74 % and 
even 77 %. In the next stage, we will increase the number of sentence templates and 
further improve the details of structuring, so it will be possible to create a really big and 
usable corpus. The corpus will be used to train a program module of the word recognizer 
of a new developed tourist information dialogue system in the next few weeks. 

7 Conclusion 

During the first phase, we analyzed only the first requests of clients, their first utterances 
after entering the tourist information office. About 60 % of our corpus of real dialogues 
could be used for the automatic generation of sentence templates, but we expect a much 
higher percentage of usable dialogues for the generation of other sentence templates 
(further questions and requests of clients, problem solving, misunderstandings and so 
on), because recording troubles and missing passages occurred mostly at the beginning 
of dialogues. Nevertheless, our corpus of real dialogues in Czech tourist information 
centres is big enough and represents various regions of the Czech Republic. 

With regard to authentic occurring data and the results of sentence generation we 
have to state, that there are constructions which have to be valuated as ungrammatical 
from the viewpoint of standard grammar, but which are fully functional and acceptable in 
spoken communication. They do not (or rarely) cause misunderstandings, rather they are 
looked at as economic linguistic means in order to solve problems quickly. The training 
corpus has to cover all these cases. 
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Abstract. This paper deals with adaptation of typical dialogue manager structure 
to dialogue system for blind and purblind programmers within the framework of 
the project: “Dialog System for Support of Visually Impaired Programmers” *. 
The main goal of this project is to develop a dialogue system allowing generating, 
editing and debugging of source code. The dialogue is based on the natural speech 
communication, supported by the speech recognition system and speech synthesis 
considering the large extent prosodic aspects of the Czech language. 



1 Introduction 

The technical focus of this article is the development of a dialogue system for visually 
impaired programmers. Our attention will be confined to a system in which natural lan- 
guage plays an important part in the communication process. We suppose both totally 
blind and also purblind people who can use keyboard and especially arranged visual 
output to the display screen. We have developed a graphical interface for purblind pro- 
grammers based on large fonts text output. However, the dialogue in the spoken natural 
language is namely seen as the most convenient kind of the man-machine communication 
and the only one meaningful possibility for blind people . 

Therefore, the presented paper deals with the development of spoken language di- 
alogue system enabling visually impaired persons creating, editing and debugging of 
programs. The system will communicate by linking up classical access with the help of 
keyboard and spoken language interface. Many principles of the proposed system seems 
to be advantageous also for sighted programmers for their comfortable programming. 

The main components of typical spoken language system are: speech recognizer, 
linguistic analyzer, speech synthesizer and dialogue manager with message generator 
module. The dialogue manager is usually developed to operate on semantic units that 
represent language and domain independently. We decided to adapt typical dialogue 
manager structure to programming environment for visually impaired programmers. 

2 Dialogue Simulation 

We have created and observed initial part of corpus with human-human dialogues recor- 
ded by simulation of communication between visually impaired programmer and dia- 
logue system. 

' Project No. 201/99/1248, funded by Grant Agency of the Czech Republic 
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The unit of corpus called ‘record’ is defined as following: Two persons talk in a silent 
room equipped with a computer and a sensitive microphone. One person represents a 
programmer and has no visual contact with the compiler environment on the monitor. 
He/she has to solve a task using a programming language (e. g. to write a program which 
computes circle volume). The second person, called ‘recorder’, represents dialogue sys- 
tem and operates with compiler environment to satisfy programmer’s commands. Thus 
he/she writes the source code in editor, compiles source code and answers questions 
about written code and compiler behaviour. With help of the recorder the programmer 
can use all conveniences of the program developing environment - editor, compiler, de- 
bugger. Following the voice communication between speakers the programmer ‘creates’ 
a program in standard steps: 

1 . generates code, 

2. runs program, 

3. edits code, 

4. repeats steps 2) and 3) until program runs properly. 

The current record stops as soon as the solved program runs well. 

2.1 Tasks 

There is a large set of different common simple tasks usually used for teaching students 
programming languages and techniques. The following tasks have been chosen in the 
first stage of corpus creation: 

- Hello world - program simply writes to the output string ‘Hello world’ 

- Circle volume - program computes circle volume 

- Iterative factorial - program computes factorial using iterative algorithm 

- Recursive factorial - program computes factorial using recursive algorithm 

- Selection sort - program sorts data using selection sort algorithm 

- Insertion sort - program sorts data using insertion sort algorithm 

- Substring search - programs looks up for a substring in a string, only a simple 
algorithm based on comparison of all characters is used 

The programmer has to solve basic algorithm as well as program input and output. It is 
always a problem in program creating process to manage input and output even if they 
are simple (for example, an array of real numbers). Thus, we can investigate complex 
negotiation between the programmer and the recorder in our corpus. The longest record 
which was recorded is 58 minutes long (‘SelectSort’ in one case). 

2.2 Dialogue Classification 

In order to cover many real situations we have considered several types of dialogues. At 
first we distinguish two cases: In the first case the programmer simply reads a program 
code prepared in advance in most efficient way. In the second case he/she has theoretical 
knowledge of the problem and creates the program code offhand. There are three groups 
of programmers in our corpus: 

- experienced programmers 

- novice programmers 

- non-programmers (who can only simply read the pre-printed program code) 
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3 Dialogue Manager 

As the result of many research projects (for example SUNDIAL, see [6]) there has been 
designed and developed a flexible domain and language independent dialogue manager. 
We call them ‘standard’ dialogue manager. The new dialogue manager is developed 
using a generic dialogue manager software package of SUNDIAL type. 

3.1 Standard Dialogue Manager 

The dialogue manager completely monitors the dialogue course. It consists of four 
modules: linguistic interface, belief module, task module and dialogue module. Dialogue 
module controls dialogue process, it obtains a semantic interpretation of a user’s utterance 
from linguistic interface and passes a semantic representation of system message to 
linguistic interface. Task module is responsible for accessing an application database. 
Belief module provides a repository of semantic knowledge based on ‘belief model’, 
which is concerned with modelling the world [8]. The specific knowledge bases could 
be associated with each module, aiding customization. 

The standard dialogue manager is language and task independent. The same software 
can handle any language with any task. The respective parser for that language can 
build required parsed structure. System can be configured to the desired combination of 
language and task at initialization time. A set of software flags determines, for example, 
which task knowledge base will be loaded, which application system interface and which 
parser for which language will be used. So, the dialogue manager itself is a generic 
system, and any running configuration is one instance of that generic system. 

Because of the typical modular structure of dialogue systems and the uniform SIL 
interface (see section 4), developers can exchange modules inside the dialogue manager 
or add or exchange external components (e. g. a generation module starting from semantic 
descriptions instead of a template generator) after they have tested these components by 
means of ‘test flies’, without having to deal with the internals of other modules. Thus, 
the dialogue manager can be adapted to new technologies easily, while the main body of 
the system remains fully operational. Therefore, it is possible to adapt standard dialogue 
manager to dialogue system for visually impaired programmers. 

It is necessary to describe specific dialogue features for dialogue with visually im- 
paired programmer in order to derive specific dialogue manager features. 

3.2 Specific Features of the Dialogue 

In dyadic dialogue communication between humans, conversation between speakers is 
characterized by turn-taking: in general, one participant. A, talks, stops; another, B, talks, 
stops, and so we obtain an A-B-A-B-A-B distribution of talk across two participants 
[3]. The humans alternate in their speaking orderly. For instance, there is less than 4% 
speaking overlap in our corpus. 

The development of dialogue system for visually impaired programmers is a specific 
application area of human computer interaction technologies. A user orders commands 
regarding the generating, editing and debugging of a program and expects the system 
to be able to read the written source code over and to describe compiler behaviour. 
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Thus, the following important dialogue features specific for proposed system should be 
mentioned: 

- limited subset of natural language 

- specific kind of conversation (synthetic approach) 

- special (programming language dependent) dialogue scenario 

- use of rather large set of keywords 

- step-by-step development of the program source code 

- rich set of presupposed common and general production rules (common knowledge) 

- the user assumes, or requires a lot of common (general) facts which have to be stored 
in the system database 

- advanced speech input verification and acknowledgement 

- use of many predefined words (e. g. reserved words) 

- many extensions (support) for the typed input (if required) 

- frequent use of isolated spoken or spelled words (names) 

- frequently repeated source code reading 

- frequent searching in the completed part of the source code 

- frequent returns to the completed part of the source code and its corrections 

- strong definition of the dialogue subset for the source code editing 

- possible generation (pre-generation) of some parts of the program source code 

- permanent checking of the created program source code 

- advanced error detection and advisement (sound warnings) 



3.3 Specific Features of the Dialogue Manager 

The specific features of the dialogue manager are consequence of dialogue features 
specific for communication between visually impaired programmer and dialogue system. 
The main features are following: 

- special processing of keyword input 

- processing of the typed input (if required) 

- rather large storage area for predefined symbols, keywords, reserved words etc. 

- processing of spelled names and program constructions 

- storage of a large common knowledge base 

- dynamic processing of occurring facts 

- storage of the complete dialogue history path 

- frequent searching in the dialogue history path ^ 

- special organization of the dialogue history path (the classical stack organization 
seems to be insufficient) 

- use of quite special internal utterance representation (modified SIL, see section 4) 

- step-by-step internal semantic processing 

- very specific task (application) module preserving the communication with the cre- 
ated program source code, including editing 

There are two possibilities for specification of identifier used in source code. Either 
a visually impaired programmer spells the identifier or he/she uses specifically adjusted 
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keyboard. The system is supposed to use speech recognition module for receiving of 
user’s commands and questions and a module for sounds generation. 

Computer generated sounds can be divided into two classes: synthesized speech and 
non-speech audio. Current systems for computer access for blind users are dominated 
by the use of synthesized speech but support of non-speech audio is also very useful. 
For example, it is better to use some earcon with strictly defined semantics instead of 
long system speech because a user may not keep paying attention to listening. 

Ideally, the dialogue manager ought to respond to user’s input ‘immediately’. Be- 
cause processing takes a finite time, the only way how to achieve this speed of response 
reaction in principle is to process input incrementally while the user is speaking. In fact 
the typical pipe-line architecture of the system limits the incremental processing because 
the dialogue manager only obtains its input after the user has finished speaking, after the 
front end has recognized the end of the user utterance and after the linguistic processor 
has done its job. The dialog manager is then able to begin processing. Using the pipe-line 
architecture the dialogue manager could be made to react more quickly by running it 
on fast hardware. Alternatively, it is also possible that more efficient search algorithms 
could be used in the belief module. 

Following the study of dialogue corpus we looked for possible rules of dialogue 
manager reactions, see examples in Table 1 . 



Table 1. Example of system response rules 



user’s utterance 


system reaction 


Open a new function 


Do you wish to open a function with an output 
value? If yes, specify a type of the output param- 
eter! 


I’d like to correct the eighth line 


- searching for the beginning (head) of the actual 
function 

- moving to the required line 


Correct the function parameter 


- searching for the function head 

- searching for the parameter list 

- generating the corresponding response 


Close the parameter list 


generating the right parenthesis and asking for 
the next action, e. g. May I generate the left 
brace? 







4 Semantic Representation 

Communication among system modules and between the dialogue manager and the 
linguistic processor is realized using the Semantic Interface Language (SIL). Requiring 
all interfaces to be dehned in terms of this language aids the transparency of the code 
and reduces the amount of intermediate processing which otherwise would be required 
to translate from one protocol to another. 
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One of the major features of the SIL language is that it allows semantic description 
generated hy a parser. These semantic descriptions consist of arbitrarily many sub- 
descriptions. Thus, the parser does not necessarily have to hnd a semantic description 
spanning the whole utterance, but it can send the semantics of any sub-piece that it has 
found for dialogue manager to ‘make sense of it’ . This adds considerably to the robustness 
of the system, as corrupted speech input, with non-words like ‘ehm’ or excessive noise 
in single places, or utterances consisting of more than one sentence still can be parsed 
to a certain degree, and some interpretation and an appropriate system reaction will take 
place nevertheless. 

SIL provides a simple semantic representation of utterances and could be also used 
for knowledge representation. At linguistically oriented level representation of utterance 
in SIL consist of UFOs (Utterance Filed Objects), each UFO represents a part of utterance 
[8]. You can see an example of UFO representation of utterance ‘Replace the function 
parameter KU in Figure 1. 



syntax : 



semantics : 



string : replace the function parameter K1 ] 

id : makeSl 
type : replace 



theagent : 



theobject : 



id : none 
type : individual 
value : speaker 
id : fund 
type : function 

id : fnamel 

thename : type : functionname 

value : unknown 
id : pari , 

thegoal : type : parameter 

value : K1 



Fig. 1. SIL-representation of the utterance “Replace the function parameter Kl” 



5 A Short Example of the Dialogue 

In Example 1 there is a fragment of the possible dialogue between user and system. Let 
us suppose that user applied for correcting the function ‘Scanner: :getChar’ in previous 
(not here mentioned) utterances. At the beginning of the presented fragment system 
conhrms user’s intention and starts to negotiate with him. 



6 Conclusion and Future Work 

The development of a dialogue system for visually impaired programmers is a complex 
task of human computer interaction technologies aiding visually impaired people to 
create, edit and debug programs. We have analysed dialogues recorded during simulation 
of a dialogue between visually impaired person and system which were transcribed using 
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S: Do you want to correct the function void - upper-case-s 

- lower-case-c - a- n- n- e- r- colon - colon - 

- lower-case-g - e - 1 - upper-case-c - lower-case-h - a 

- r - left parenthesis - void - right parenthesis ? 

P: Yes, the sixth line. 

S: The sixth line of the corrected function comprises the 
statement: if - left parenthesis - lower-case-i - n - 
upper-case-f - lower-case-e - o - f - left parenthesis 

- right parenthesis - right parenthesis. 

P: OK, put in upper-case-f as the parameter of the 
function ! 

S: Completed, the sixth line comprises now : if - left 
parenthesis - lower-case-i - n - upper-case-f - 
lower-case-e - o - f - left parenthesis - upper-case-f - 
right parenthesis - right parenthesis. Are you satisfied 
with the correction ? 

P: Yes, further I want . . . 

Example 1. A fragment of a dialogue 



GAT [9] transcription scheme. Specific requirements for adaptation of standard dialogue 

manager to dialogue system for visually Impaired programmers were formulated. The 

development will go on with the following steps: 

- extension of corpus using more tasks 

- extension of rules for dialogue manager knowledge base to cover dialogues states 
in scope so large as possible 

- implementation of a dialogue manager based on (and respects) the tree structures 
of the created program modules and source code 

We assume, that voice operated multilingual intelligent system will be used not only by 

visually Impaired people, but also by disabled and sound people in general. 
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Abstract. The paper deals with the help system as a part of dialogue grammar 
based development environment. This programming environment is intended es- 
pecially for visually impaired programmers. The generation of help dialogues 
according to language grammar is described. Different levels of the help system 
are widely discussed and specific examples for C/C-l-l- language are presented. 
The final part of the paper is devoted to possible improvements in the current help 
system. 



1 Introduction 

This paper deals with the special application of dialogue systems - dialogue based 
programming. The first idea of this kind of dialogue system was introduced in [2]. 
The design of such a system is also described in [5]. A new special language intended 
especially for visually impaired programmers and a spoken dialogue system based on 
this language were also introduced in [2] . 

A new version of this kind of system named AudiC is a dialogue grammar based 
system that provides a uniform dialogue-programming environment for visually im- 
paired programmers. The AudiC system is being developed in co-operation of Faculty 
of Informatics, Masaryk University, Brno, CZ and Department of Computer Science 
and Engineering, University of West Bohemia, Pilsen, CZ. The system environment is 
supposed to be adapted to widespread programming languages. There is no require- 
ment to any visual feedback. It means that the whole system is speech oriented and 
fully controlled by speech and keyboard commands. The main goals of the system are 
following: 

- to provide a programmer with a user-friendly programming environment 

- to provide a user with the effective code generation that ensures minimal amount of 
syntactical errors 

- to allow a user to read, edit and debug the source code most easily 

- to familiarize a novice programmer with supported programming languages 

- to provide a user with a complex and easy-to-use help system 

The help dialogue system is one of the integrated AudiC system components. Espe- 
cially two last goals of this system mentioned above have to be taken into consideration 
during formation of the help system design. 
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The currently implemented version of the AudiC system enables visually impaired 
people to write programs in C/C++ language. It means that the help system discussed in 
following parts as a special part of the AudiC dialogue application descrihes the C/C++ 
language. 

2 Help System 

2.1 Basic Approaches 

The AudiC help system as the integrated part of the AudiC dialogue system requires 
combining of two different approaches. On the one hand, the help system has to be able 
to clearly explain fundamental problems to novice programmers. On the other hand, the 
help system must not hinder the experienced programmer with redundant information. 
The possibility to visually obtain or omit a piece of information in the written text is 
much convenient than to choose the appropriate sound information. We also have to 
take into account that every user has a particular knowledge of certain part of C/C++ 
language. The world of difference is especially in the knowledge of function libraries. 
The final solution of these opposite requirements is introduced in section 2.2. 

2.2 Help System Structure 

In order to enable a user to choose an appropriate help information we introduced four 
levels of the help system. The system of levels has following properties: 

- The levels are mutually independent. 

- The levels are accessed via different special keys at any time. 

- The levels provide different type of information. 

- There is a possibility to switch among levels at any time. 

- The components of the help system are connected with hyperlinks at the same level. 

2.3 Help System Implementation 

The AudiC help system as well as the whole AudiC system is written in VoiceXML (Voice 
extensible Markup Language). VoiceXML is an open, broadly supported language de- 
signed to make Internet content and information accessible via voice and phone (see [6]). 
The help system is divided into special parts called help dialogues. Each dialogue de- 
scribes a terminal or nonterminal symbol of the C/C++ grammar or a special term of the 
C/C++ language at each level of the help system. All the terminal symbols, nonterminal 
symbols, special symbols and relations between them are stored in the database. This 
database is used by a special program for generation of all the help dialogues (see sec- 
tion 2.6) at each level. The control of transfer between dialogues is implemented with 
VoiceXML references. 
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2.4 Help Context 

The source code generated by a programmer is represented as a tree structure. The node 
of the tree and the name of corresponding help dialogue form the structure called help 
context. A user can switch to the help context at any time of generating, reading or 
editing the source code. Then the old context is suspended and the help context is loaded 
into the VoiceXML interpreter by the user’s request. 

2.5 C/C++ Grammar 

The C grammar is a context free grammar with definition on the right hand side of the 
syntactical rule. This definition consists of three basic symbol types: terminal symbols, 
nonterminal symbols and optional nonterminal symbols. The definitions on the right 
hand side can also create a list. However, there is a large number of syntactical rules in 
C/C++ language. That was the reason we had decided to analyze a sample of source files 
in C/C++ language and evaluate the usage frequency of syntactical constructions. There 
was analyzed 859 source files written in C language and set up statistics considering 
usage frequency. 

2.6 Help Dialogue 

Each terminal or nonterminal symbol of the C/C++ grammar is described with own help 
dialogue. All dialogues at each level of the help system share the same dialogue structure 
that represents the basic dialogue strategy (see Figure 1). The basic dialogue strategy is 
following: 

- The main help section containing information about corresponding terminal or non- 
terminal symbol or a special term is read through. 

- The next help section is read through, the user selects from the menu of help items. 
The menu items refer to related help dialogues. 

- If the user selects an item from the menu of help items, the dialogue transfers control 
to the selected help dialogue. 

- If the user does not select an item from the menu of help items, the subdialogue is 
read through, then the next help section is read over. 

- The help subdialogue informs the user about the possibility to leave the help context 
or to switch among the levels of the help system. 

- The help dialogue can be interrupted at any time, the user returns to the previous 
context. 

Important notes to the source code (see Figure 1): 

- The main help section and the next help section are separated with special ’next help 
sounds’. 

- The individual help menu items are separated with special ’select sounds’. 

- The remarks are specified with < ! - - - - > . 
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<FORM id= " f orm_name " > 

<!- main help section --> 

<BLOCK NAME="main_help" > 

< PROMPT > 

[Text of the main help] 

<!- the main help is read through --> 
</PROMPT> 

</BLOCK> 

<!- next help section --> 

< FIELD NAME="next_help" > 

< PROMPT > 

<!- separation of the next help section --> 
<AUDIO SRC= "next_help . wav" > 

<!- the following text is read through --> 
Please select the item from following menu 
< ENUMERATE /> 

</PROMPT> 

<!- help menu items --> 

<OPTION DTFM="1" VALUE= " #f irst_item" 

<AUDIO SRC= " select . wav" > first item 
</OPTION> 

<OPTION DTFM="2" VALUE= " #second_item" > 

<AUDIO SRC= " select . wav" > second item 
</OPTION> 

<OPTION DTFM="3" 



<!- help subdialog --> 

<NOINPUT> 

< SUBDIALOG NAME= " he lp_about_he Ip " 

SRC= "help_main . vxml#help_about_helpI " > 
< RE PROMT /> 

</NOINPUT> 

</FIELD> 

<!- transfer of control to next help section --> 
<BLOCK name= "go_to_next_help" > 

<GOTO NEXT="next_help"/> 

</BLOCK> 

</FORM> 



Fig. 1. The basic dialogue structure 
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2.7 Levels of the Help System 
Level 1 

Information Content 

- Main help section 

• The corresponding terminal (see Figure 2) or nonterminal symbol is described. 

• The common terms not included in the language grammar are described. 

- Next help section 

• Terminal symbols, nonterminal symbols and common terms used in the expla- 
nation of the basic term are referred. 

Target users 

- the visually impaired novice-programmers who want to familiarize with basic terms 
of programming language 

Remarks 

- Many descriptions are language independent so they can be adapted in the help 
systems of other programming languages. 

- The description of special terms is independent on the syntactical rules. 

<FORM id="CASEl"> <!- main help section --> 

The keyword Case as a part of case statement is used in 
conjunction with switches to determine which statement evaluates. 

<!- next help section --> 

<OPTION DTFM="1" VALUE= "help_termsl . vxml#keywordl " > 

<AUDIO SRC= " select . wav" > keyword </OPTION> 

<OPTION DTFM="2" VALUE= "help_termsl . vxml#SWITCHl " > 

<AUDIO SRC=" select. wav" > switch </OPTION> 

Fig. 2. The help dialogue for terminal symbol ’CASE ’- Level 1 



Level 2 

Information content 

- Main help section 

• Terminal symbol or common term is described - information that terminal 
symbol or term not included in grammar was asked about is provided. 

• Nonterminal symbol is described (see Figure 3) - information about the set of 
corresponding syntactical rules is provided. 

- Next help section 

• Terminal symbol or common term not included in the grammar is described - 
no information is available. 

• Nonterminal symbol is described - all terminal and nonterminal symbols or list 
of symbols (see Figure 3) they appear on the right hand side of all admissible 
syntactical rules are referred. 
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Target users 

- both the visually impaired novice and experienced programmers 

Remarks 

- This level of the help system is strictly based on the language grammar 

<FORM id=" labeled- statement2 " > 

<!- main help section --> 

The labeled- statement syntax 

<AUDIO SRC= "bing . wav" /> identifier : statement 

<AUDIO SRC= "bing . wav" /> CASE constant expression : statement 

<AUDIO SRC= "bing. wav" /> DEFAULT : Statement 

<!- next help section --> 

<OPTION DTMF="1" VALUE= " #ident if ier_ ;_statement2 " > 

<AUDI0 SRC= "ping . wav" / > identifier ; statement 
</OPTION> 

<OPTION DTMF="2 " VALUE= " #CASE_constant_expression_ : _statement2 " > 
<AUDIO SRC= "ping . wav" / > CASE constant expression ; statement 
</OPTION> 

<OPTION DTMF="3" VALUE= " #DEFAULT_ : _Statement2 " > 

<AUDIO SRC= "ping. wav" /> DEFAULT : Statement 
</OPTION> 

Fig. 3. The help dialogue for ’labeled statement’ - Level 2 



Level 3 

Information content 

- Main help section 

• Terminal or nonterminal symbol is described - a detailed syntactical informa- 
tion, typical usage and remarks dealing with conventions are provided. 

• Common terms not included in the grammar are described - description of usage 
in detail and remarks dealing with conventions are provided. 

- Next help section 

• Terminal symbols, nonterminal symbols and common terms they are used in 
the explanation of the basic term are referred. 

Target users 

- both the visually impaired novice and experienced programmers 

Remarks 

- There is no explicit solution if we have a list of symbols on the right hand side of 
the syntactical rule. It seems sometimes to be better to convey all the information 
about the starting symbol. On the other side, there is a possibility to make a user 
move to the help dialogue corresponding to a symbol from the list. The final solution 
depends on the number and differences in usage of all symbols in the list. 




Level 4 

Information content 
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- Main help section 

• Terminal symbol is described - an example where the terminal symbol is fre- 
quently used is provided. 

• Nonterminal symbol is described - a common example or information that there 
is no possibility to present a common example for this nonterminal symbol is 
provided. 

- Next help section 

• Examples from the same problem area or examples explaining the usage of ter- 
minal or nonterminal symbols or functions used in current example are referred. 

Target users 

- especially for the visually impaired novice programmers who want to familiarize 
with real examples of the programming language 

Remarks 

- The main help section presents two or more short examples if they explain the 
problem together in a better way. 

2.8 Function Libraries 

The standard function libraries are implicitly not the part of the language grammar. 
However, it would be very useful for our purposes to consider them as the part of 
grammar. For that reason we extended the language grammar with following syntactical 
rule: 

Program— > function library— > function 

where function library is a nonterminal symbol and function is a terminal symbol 

Then the information content of the help system levels for the nonterminal symbol 
function library corresponds to the information content of the levels for nonterminal 
symbol with syntactical rules containing the list of symbols on the right hand side. 

The description of functions as terminal symbols is following: 

Level 1 

- Main help section - the basic characteristics of function and relation to function 
library are provided. 

- Next help section - similar functions are referred. 

Level 2 

- Main help section - information about function syntax is provided. 

- Next help section - syntax of similar functions is referred. 

Level 3 

- Main help section - characteristics of function in detail, number and types of pa- 
rameters, return value, remarks and conventions are provided. 

- Next help section - function parameter types are referred. 

Level 4 

- Main help section - one or more function examples are provided. 

- Next help section - examples of similar functions are referred. 
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3 Conclusion and Future Work 

The AudiC help system as the part of the AudiC dialogue system is an example of 
dialogue-based programming. It is provides a visually impaired programmer with a com- 
plex information in four levels. This system of levels enables to partly separate informa- 
tion for novice and experienced programmers. The system of references contributes to 
fast orientation in the help system and increases probability that a user obtains required 
information in several steps. Nowadays the mostly frequented parts of the language 
grammar are implemented. We have to wait for a feedback from all groups of program- 
mers to be able to evaluate and eventually correct division into levels. Then we will be 
also able to decide which grammar parts are strongly used and require paying a special 
attention and which grammar parts are usually omitted by programmers. We believe that 
the AudiC help system will be useful not only for visually impaired programmers but 
also for other disabled and sound programmers in C/C-H- language. In the future we 
suppose to build the help system for other languages, especially for Java. 
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Abstract. The increase in the use and availability of Internet led to the devel- 
opment of large number of web-based applications. However interfaces of many 
of these applications are not suitable for people with special needs. This paper 
describes how an HTML interface of web-based applications can be transformed 
into dialogue one that seems to be more suitable for those people. The paper also 
describes the basic principle of the transformation as well as detailed description 
of some algorithms used during the process. 



1 Introduction 

We can find a lot of applications that use web-based interface. The web-based inter- 
face makes the applications available to people all around the world. Unfortunately the 
applications are not suitable for people with special needs due to the visually oriented 
interface. The advantage of the dialogue interface is that it can reduce the amount of 
data, that are transferred to the user in a single step. To implement the dialogue interface 
we decided to use the VoiceXML [1] language. 

Because the design of the application interface using HTML and VoiceXML at the 
same time can be expensive especially with legacy applications, we tried to hnd more 
efficient method how to create the dialogue interface. This lead to the idea to create 
client that will convert an HTML document into Voice-XML one [2] and generate the 
dialogue interface this way. 

2 Why to Generate Dialogue Interfaces 

We will discuss three possibilities how to implement a dialogue interface for an applica- 
tion. The hrst possibility is to built-in the dialogue interface into an application when it 
is designed. The second possibility is similar to the hrst one. We can create a specialized 
dialogue interface for the particular application manually. 

Although we can get very good results when we create the dialogue interface man- 
ually, because we can use specihc features of the application, the resulting interface 
is application dependent and we must create it manually for every application. This 
disadvantage makes the solution practically unusable in the general dialogue interface. 

The third possibility is to generate a dialogue interface on the client-side. To imple- 
ment this solution we must create the client that will transform the HTML interface into 
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the dialogue one. This solution is application independent and it can be reused by many 
applications. There is no need to modify applications, the interface can be used with 
legacy applications and many others. The disadvantage of this solution is that the result- 
ing interface can be less optimal than the manually created one. There is a limitation of 
the visual interface that the interface must be implemented using the plain HTML for 
proper functionality. The algorithm fails for example, when the interface is implemented 
using some other standards such as Java applets. 



3 Basic Principles of the Dialogue Interface Generation 

We will describe a basic algorithm of the client that generates the universal dialogue 
interface now: 

1. Transform the existing HTML interface into the VoiceXML one. 

2. Interpret the VoiceXML document. 

3. When the interpretation of the document is done, the interpret returns a VoiceXML 
submit, an URL of the next document to be processed. 

4. If the URL is not void: 

(a) get the document at the specihed URL and go to the step 1 

(b) else go to the start page. 

There are several problems in the described algorithm to be solved: 

1. How to transform a general HTML document that contains tables, images, forms 
and other HTML objects into VoiceXML one. 

2. How to get parameters that should be submitted to the application. 

3. How to get a sufficient information for each parameter from the source document. 
The information is used in a dialogue to inform the user of the expected inputs. 

4. We must use a good dialogue strategy for the resulting dialogue to obtain all required 
information from the user using the smallest number of steps. 

The solution of almost all the problems is described in the next sections. 



3.1 Basic Principles of Generation of VoiceXML Documents 

The executive parts of the web interface of many web-based applications are the HTML 
forms. The rest of the document are information about the function of the document, 
the meaning of the input helds, etc. The corresponding algorithm for translation of the 
document can be following: 

1. Get all text outside the form(s) included in the document in the order as the text 
appears and translate it into the VoiceXML. 

2. For all forms: 

(a) Get all inputs of the form and their description. 

(b) Translate each input into a corresponding VoiceXML field. 
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3.2 Getting Descriptions of an Input Field 

Next problem referred above was how to get description of each input field in the 
document. The solution is based on the common structure of web interfaces of the 
applications. The structure can be described in the following way: 

1. Introduction, the aim of the document, ... 

2. The main form of the application. 

3. Some more information about the sense and possible values of the input fields 
required from user. 

4. The rest of the document. 

This scheme shows us what information should be known to the user, before we ask 
him the values that should be processed. We need to read him the introduction (P* part 
of the document) and the information in the 3’’^^ part of the HTML document. When 
we use the dialogue strategy with a system initiative (see section Dialogue strategy for 
details) we should describe the user the meaning and the possible values of each entered 
input field as well. 

To obtain the description of input fields we can use one of the following algorithms: 

1 . Complete semantic analysis of the document. 

2. Some heuristic algorithm based on the structure of documents. 

We decided to use a heuristic, because we don’t know about an algorithm that per- 
forms a message understanding algorithm on general documents. The results of the 
heuristic seems to be acceptable and the heuristic is fast. 

The principle of the heuristic can be described in following way: 

1. Use the ALT field of each HTML input field. 

2. If the form is placed in the table (for better visual design), look for the description 
in the same row and then in the same column. Use this value if you find a text in the 
appropriate cell. 

3. If you find the name of the input field in previous text use the sentence containing 
the word instead. 

4. If you don’t find the alternate description of the input during previous steps use all 
text between the previous input and this one. 

5. If you did not find any description during the previous steps then use the name of 
the input. 

As we have described above, this algorithm is used only in connection with the 
system initiative dialogue strategy (see the next section for details). 



4 Dialogue Strategies 

The dialogue interface implements two basic dialogue strategies to acquire the informa- 
tion from the user. The first strategy is based on direct 1 : 1 translation of HTML inputs to 
VoiceXML input fields. Such a VoiceXML document implements the system initiative 
dialogue strategy. A short example follows. Let us have the following HTML form: 
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<FORM action= " some . php" > 

Enter your name: <INPUT type = text name=namexBR> 

Enter your surname: <INPUT type = text name=snamexBR> 

Your sex: <INPUT type=radio name=sex value="Male" >Male 
<INPUT type=radio name=sex value= " Eemale " >Eemale<BR> 

<INPUT type=submit value= " Proceed" > 

</FORM> 

This form will be translated into the following VoiceXML (every HTML input is 
translated into the VoiceXML field with the same name): 

<form id="forml"> 

<field name="name"> 

<prompt> 

Enter your name: 

< /prompt > 

</ f ield> 

<field name= " sname " > 

<prompt> 

Enter your surname: 

< /prompt > 

</ f ield> 

<field name="sex"> 

<prompt> 

Your sex. Possible values are male, female: 

< /prompt > 

<grammar type= "application/x- j sgf " > 
male | female 
</grammar> 

</ f ield> 

<field name= " submit " > 

<prompt> 

Proceed. Yes or No 
< /prompt > 

<grammar type= "application/x- j sgf " > 
yes I no 
</grammar> 

<filled> 

<if cond= " submit== ' yes ' " > 

<submit next =" some . php" namelist="name sname sex"/> 

</ if > 

<clear namelist="name sname sex submit"/> 

</filled> 

</ f ield> 

</form> 

The exact translation of HTML inputs will be described later. 

The second dialogue strategy is very close to the dialogue strategy that is used in 
a real life. The strategy is following: 

1 . Read all useful information to the user. 

2. Let the user answer full sentence. 
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3. Try to retrieve all possible information included in the user’s answer. 

4. Ask the user for the required information not included in his or her answer (it will be 
good deal to use the system initiative dialogue strategy to perform this step, because 
we can ask the user exactly for the missing information). 

The example VoiceXML document for this strategy follows: 

<form id="fl"> 

<field name= "values " > 

<prompt> 

Enter your name , surname and sex 
< /prompt > 

<f illed> 

<prompt> 

Thank you for information. Your request is being processed. 
< /prompt > 

<submit next="some .php" namelist= "values "/ > 

</filled> 

</ f ield> 

< / f orm> 

The dialogue realized by this VoiceXML form looks like the following conversation: 
System: Enter your name, surname and sex. 

User: My name is Jan Novak and I’m a male. 

System: Thank you for information. Your request is being processed. 

The dialogue corresponding to the HTML document generated by some.php (the 
value of the attribute next in the tag submit) follows. 

Step 1 and 2 of the discussed dialogue strategy are included in the VoiceXML doc- 
ument. Step 3 will be realized by the client and step 4 will be realized using a next 
VoiceXML document when it is necessary. 

To perform step 3 we will use a set of template answers. The principle of the algorithm 
is following: 

1 . Try to find a template matching the users answer. 

2. If you find any, try to extract the required values (see the following example). 

3. If the previous step fails, inform the user, that you did not understand his answer 
and use the system initiative dialogue strategy to retrieve the required values. 

The short example follows: 

System: Enter the name of the author or (and) the name of the book. 

Enter the type of the search (exact/substring) as well: 

User: I’d like Raven from Edgar Alan Poe please. 

System: Enter the type of the search. Possible values are exact or substring. 

User: Exact. 

System finds fhe applicable template fhat will seem for example “I’d like <fille> 
from <author> [resf]. ” System assigns values Edgar Alan Poe as the name of the author 
and Raven as the title of the book and asks the user for missing parameters (the search 
type in this example). 

The database of template answers will be built using the “Wizard of Oz” method [4], 
that is used to simulate the human-machine dialogue. 
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5 Translation of the HTML Forms Inputs 

The system is based on the translation of HTML inputs as we described in the previous 
sections. The translation of HTML inputs will affect the quality of the output dialogue. 
This section describes the translation algorithm chosen for the first implementation of 
the system. We will describe translation algorithm which is used in the system initiative 
dialogue strategy only in this section. The translation for the second kind of dialogue 
strategy outlines the example in the previous section. 

The rules for transcription are simple and can be described by the following scheme: 
<INPUT type=type name=name> — 

<field name=name> 

<prompt> 

description [+ possible values (are used for inputs like radio, 
check box, select) ] 

< /prompt > 

[<grammar> 
possible values 
</grammar>] 

[<filled> 

rules for translation of the user answer using similar words 
</field>] 

The pair of inputs submit-reset will be translated following way: 

<INPUT type=submit value=value> — 

<field name= "value " > 

<prompt> 

Search (yes/no) 

< /prompt > 

<grammar> 
yes I no 
< /grammar > 

<f illed> 

<if cond= "value== ' yes ' " > 

<submit next="form action" namelist= " list of forms inputs"/> 
</ if > 

<clear namelist="list of form's inputs"/> 

</filled> 

</ f ield> 

The exact translation depends on the type of the input, as you can see from the scheme. 
Short example of the translation was described in the section Dialogue strategies. 

6 Conclusion 

This paper describes basic principles of automatic generation of dialogue interfaces used 
in general dialogue interface for web-based applications. We will try to implement some 
principles of semantic analysis of HTML documents that will improve the description 
of input fields. We will also try to optimize the structure of the VoiceXML documents 
to get more optimal dialogues. 
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